Methods, apparatus and systems for amplification-free dna data storage

ABSTRACT

In various embodiments, amplification-free DNA information methods, apparatus and systems are disclosed. A method of amplification-free information storage and retrieval comprises encoding digital data such as binary into nucleotide sequence motifs using an encoding scheme, and synthesizing replicate DNA molecules using an amplification-free DNA writing process. The amplification-free process of decoding the information stored in the DNA comprises exposing at least one of the replicate DNA molecules to a molecular electronics sensor that generates distinguishable signals in a measured electrical parameter of the sensor, wherein the distinguishable signals correspond to the sequence motifs, providing decoding back to the digital data.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and the benefit of U.S. ProvisionalPatent Application Ser. No. 62/570,458, filed Oct. 10, 2017 and entitled“Methods, Apparatus and Systems for Amplification-Free DNA DataStorage,” the disclosure of which is incorporated herein by reference inits entirety.

FIELD

The present disclosure generally relates to electronic data storage andretrieval, and more particularly to an amplification-free DNAinformation storage and retrieval system for storing and retrievingdigital data using DNA molecules.

BACKGROUND

The advent of digital computing in the 20^(th) Century created the needfor archival storage of large amounts of digital or binary data.Archival storage is intended to house data for long periods of time,e.g., years, decades or longer, in a way that is very low cost, and thatsupports the rare need to re-access the data. Although an archivalstorage system may feature the ability to hold unlimited amounts of dataat very low cost, such as through a physical storage medium able toremain dormant for long periods of time, the data writing and recoveryin such a system can be the relatively slow or otherwise costlyprocesses. The dominant forms of archival digital data storage that havebeen developed to date include magnetic tape, and, more recently,compact optical disc (CD). However, as data production grows, there is aneed for even higher density, lower cost, and longer lasting archivaldigital data storage systems.

It has been observed that in biology, the genomic DNA of a livingorganism functions as a form of digital information archival storage. Onthe timescale of the existence of a species, which may extend forthousands to millions of years, the genomic DNA in effect stores thegenetic biological information that defines the species. The complexenzymatic, biochemical processes embodied in the biology, reproductionand survival of the species provide the means of writing, reading andmaintaining this information archive. This observation has motivated theidea that perhaps the fundamental information storage capacity of DNAcould be harnessed as the basis for high density, long duration archivalstorage of more general forms of digital information.

What makes DNA attractive for information storage is the extremely highinformation density resulting from molecular scale storage ofinformation. In theory for example, all human-produced digitalinformation recorded to date, estimated to be approximately 1 ZB(ZettaByte) (˜10²¹ Bytes), could be recorded in less than 10²² DNAbases, or 1/60^(th) of a mole of DNA bases, which would have a mass ofjust 10 grams. In addition to high data density, DNA is also a verystable molecule, which can readily last for thousands of years withoutsubstantial damage, and which could potentially last far longer, fortens of thousands of years, or even millions of years, such as observednaturally with DNA frozen in permafrost or encased in amber.

In spite of these attractions, using a single molecule of DNA fordigital information storage and retrieval could be inefficient or evenimpossible due to the many sources of molecular structure errors insynthesizing a DNA molecule, loss/degradation of the molecule, andlimits of signal detection from DNA sequencers used to sequence themolecule. Thus, it is frequently proposed that amplification beincorporated to provide many more molecules to engage in all theseprocesses. However, amplification will add cost, time and operationalcomplexity to the DNA information system. Therefore what are needed arespecific processes that individually or collectively remove the need foramplification steps in the various processes that comprise a DNA datastorage system.

SUMMARY

In various embodiments, an amplification-free DNA information storageand retrieval system is disclosed. In various aspects, the systemcomprises a DNA reading device, a digital data encoding/decodingalgorithm, and a DNA writing device, wherein the properties of thesethree elements are co-optimized to minimize or reduce various costmetrics and increase overall system performance. In various aspects, theco-optimization may comprise reducing the error rate of the system,through balancing, avoiding, or correcting the errors in DNA reading andwriting. In other instances, the co-optimization may comprise reducingthe DNA reading or writing time in the system, e.g., by avoiding the useof slower speed DNA sequence motifs, and/or by using errorcorrection/avoidance to compensate for errors incurred from rapidoperation of the system.

In various embodiments of the present disclosure, a method of archivinginformation is described. The method comprises: converting theinformation into one or more nucleotides using an encoding scheme, thenucleotides predetermined to generate distinguishable signals relatingto the information in a measurable electrical parameter of a molecularelectronics sensor; assembling the one or more nucleotides into anucleotide sequence; and synthesizing a pool of replicate DNA moleculeswithout amplification of the DNA molecules, wherein each replicate DNAmolecule incorporates the nucleotide sequence.

In various embodiments, the information comprises a string of binarydata.

In various embodiments, the encoding scheme converts one or more 0/1bits of binary data within the string of binary data into a sequencemotif comprising more than one nucleotide.

In various embodiments, the step of converting the information comprisesdividing the string of binary data into segments, wherein each segmentencodes one sequence motif.

In various embodiments, the binary data bit 0 encodes a homopolymer ofA, and the binary data bit 1 encodes a homopolymer of C.

In various embodiments, one or more of the nucleotides comprises amodified nucleotide.

In various embodiments, the one or more nucleotides comprise nucleotidesthat are resistant to secondary structure formation in the replicate DNAmolecules compared to a variant of the same nucleotides.

In various embodiments, the encoding scheme comprises any one orcombination of BES1, BES2, BES3, BES4, BES5 and BES6 illustrated in FIG.4.

In various embodiments, the method of archiving information furthercomprises: exposing at least one of the replicate DNA molecules to themolecular electronics sensor without prior amplification of the DNAmolecules; generating the distinguishable signals; and converting thedistinguishable signals into the information, wherein the molecularelectronics sensor comprises a pair of spaced-apart electrodes and amolecular sensor complex attached to each electrode to form a sensorcircuit, wherein the molecular sensor complex comprises a bridgemolecule electrically wired to each electrode in the pair ofspaced-apart electrodes and a probe molecule conjugated to the bridgemolecule.

In various embodiments, the step of exposing at least one of thereplicate DNA molecules to the molecular electronics sensor comprisessuspending the pool of DNA molecules in a buffer, taking an aliquot ofthe buffer, and providing the aliquot to the sensor.

In various embodiments, the buffer solution comprises modified dNTPs.

In various embodiments, the measurable electrical parameter of thesensor comprises a source-drain current between the spaced-apartelectrodes and through the molecular sensor complex.

In various embodiments, the probe molecule for the sensor comprises apolymerase and the measurable electrical parameter of the sensor ismodulated by enzymatic activity of the polymerase while processing anyone of the replicate DNA molecules.

In various embodiments, the polymerase comprises the Klenow Fragment ofE. coli Polymerase I, and the bridge molecule comprises adouble-stranded DNA molecule.

In various embodiments of the present disclosure, a method of archivingand retrieving a string of binary data in an amplification-free DNAinformation storage and retrieval system is described. The methodcomprises: dividing the string of binary data into segments of at leastone binary bit; assigning each segment to a sequence motif, eachsequence motif comprising at least two nucleotides, the sequence motifspredetermined to generate distinguishable signals in a measurableelectrical parameter of a molecular electronics sensor; assembling thesequence motifs into a nucleotide sequence; synthesizing a pool ofreplicate DNA molecules using an amplification-free DNA writing methodon a solid support, each replicate DNA molecule incorporating thenucleotide sequence; suspending the pool of DNA molecules in a buffer;taking an aliquot of the buffer; providing the aliquot to the sensorwithout prior amplification of the DNA molecules; generating thedistinguishable signals; and converting the distinguishable signals intothe string of binary data, wherein the sensor comprises a pair of spacedapart electrodes and a molecular sensor complex attached to eachelectrode to form a molecular electronics circuit, wherein the molecularsensor complex comprises a bridge molecule electrically wired to eachelectrode in the pair of spaced-apart electrodes and a probe moleculeconjugated to the bridge molecule.

BRIEF DESCRIPTION OF THE DRAWING FIGURES

FIG. 1 illustrates an embodiment of an amplification-free DNA digitaldata storage system;

FIG. 2 illustrates a general process flow of data input and retrievalfor an embodiment of an amplification-free DNA digital data storagesystem;

FIG. 3 illustrates a constrained process flow of data input andretrieval for an embodiment of an amplification-free DNA digital datastorage system;

FIG. 4 illustrates embodiments of binary data encoding schemes (BES) foruse with DNA;

FIG. 5 illustrates examples of DNA physical and logical structures forDNA molecules used for digital data storage;

FIG. 6 illustrates embodiments of various DNA synthesis processescapable of producing many distinct sequences within molecular replicatesof each sequence, wherein N sequences with M synthesis start sites areprovided per sequence;

FIG. 7 illustrates an embodiment of a molecular electronic sensingcircuit in which a bridge molecule completes an electrical circuit andan electrical circuit parameter is measured versus time, whereinvariations in the measured parameter comprise signals corresponding tointeractions of the bridge molecule with other interacting molecules inthe environment;

FIG. 8 illustrates an embodiment of a polymerase-based molecular sensorusable as a reader of data encoded into synthetic DNA molecules. Asensor comprising a polymerase produces distinguishable signals in amonitored electrical parameter corresponding to distinct DNA molecularfeatures, wherein such features can be used to encode information intosynthetic DNA molecules, which in turn can be read by the sensor;

FIG. 9 illustrates an embodiment of a polymerase-based molecular sensor,wherein the polymerase molecule is attached to a bridge moleculeconnecting source and drain electrodes, and wherein two differentsequence motifs AA and CCC produce two distinguishable signals in themonitored electrical parameter of the sensor;

FIG. 10 illustrates a detailed protein structure of the Klenow Fragmentof E. coli Polymerase I, a specific polymerase molecule of use herein ina DNA reading device;

FIG. 11 illustrates an embodiment of a molecular electronic sensorwherein a polymerase is conjugated from a specific position on thepolymerase to a bridge molecule that connects between electrodes;

FIG. 12 illustrates a specific embodiment of the molecular sensor ofFIG. 11, wherein the bridge molecule comprises double-stranded DNA, thepolymer-bridge conjugation comprises biotin-streptavidin binding, andwherein the electrodes comprise gold-on-titanium in order to supportthiol-gold bonding from the bridge molecule to each of the electrodes;

FIG. 13 illustrates an embodiment of a nano-electrode test chip and testset-up used in basic sensor experiments herein;

FIGS. 14A, 14B, and 14C illustrate an embodiment of metalnano-electrodes further comprising gold nano-dot contacts, which wasused in various sensor experiments;

FIG. 15 illustrates experimental current traces obtained by processinghomopolymer sequences of A, T, C and G;

FIG. 16 illustrates experimental data produced by the sensor of FIG. 12in which specific sequence motifs of poly-A and poly-C are shown toproduce distinguishable signals, demonstrating potential to encodebinary data;

FIG. 17 illustrates a relationship between physical DNA structure andthe logical structure of a DNA data storage molecule that comprisessuitable adaptors, a primer segment, buffer segments, and a data payloadsegment;

FIG. 18 illustrates an embodiment of a fabrication stack used to placeDNA reader sensors on a chip for a low-cost massively parallelconfiguration;

FIG. 19 illustrates conceptual architecture for a chip-based array ofpolymerase sensors and an exemplary single pixel circuit comprising atrans-impedance amplifier;

FIG. 20 illustrates an embodiment of a completed, annotated chip design,and an optical microscope image of the fabricated chip, for anembodiment of the pixel array chip of FIG. 19, having an array of 256pixels;

FIG. 21 illustrates a representation of an electron microscope image ofthe fabricated chip of FIG. 20, including insets of the nano-electrodewith a polymerase molecular complex in position;

FIG. 22 illustrates a schematic of an embodiment of a complete systemfor reading DNA data with chip-based DNA reader sensors, which is freeof amplification;

FIG. 23 illustrates a schematic of an embodiment of a cloud-based DNAdata archival storage system in which a multiplicity of the DNA readingsystem of FIG. 22 are aggregated to provide the data reader server, andin which the system is free of amplification;

FIG. 24 illustrates an alternative embodiment of a DNA data readersensor comprising a nanopore ion current sensor that producesdistinguishable signal features in the nanopore ion current whileprocessing DNA;

FIG. 25 illustrates an embodiment of a DNA data reader sensor comprisinga polymerase complexed to a carbon nanotube molecular wire spanningpositive and negative electrodes, which produces distinguishable signalfeatures in the measured current passing through the carbon nanotube;and

FIG. 26 illustrates an embodiment of a zero mode waveguide sensorcomplexed with a single polymerase, shown in cross section, whichproduces distinguishable optical signals corresponding to DNA features.

DETAILED DESCRIPTION

In various embodiments, methods, apparatus and systems are disclosedthat utilize DNA molecules as a general purpose means of digitalinformation storage without amplification of the DNA. In variousaspects, the physical DNA does not require amplification in any aspectof the entire information storage system or in any specific subsystemsof the information storage system herein. Amplification processes canimpose burdens of cost, time, complexity, performance variability, andother limitations to a DNA information storage system. Further,amplification is incompatible with DNA comprising modified bases.Therefore, methods, apparatus, and systems for DNA information storagein accordance with the present disclosure are configured to avoid DNAamplification.

In various embodiments, a DNA data storage system utilizing DNAmolecules as a general purpose means of digital information storage isdisclosed. In certain aspects, a system for digital information storagecomprises a DNA reading device, an information encoder/decoderalgorithm, and a DNA writing device. In various aspects, the systemfurther comprises a subsystem for managing physical DNA molecules tosupport data archival operations. The interrelation of these elementsand their co-optimization are disclosed.

In various embodiments, a data reader for a DNA data storage system isdisclosed. In various aspects, a DNA reading device comprises a sensorthat extracts information from a single DNA molecule, thus not requiringDNA amplification. The sensor may be deployed in a chip-based format. Invarious examples, data reading systems that support such a chip-basedsensor device are disclosed.

Definitions

As used herein, the term “DNA” may refer not only to a biological DNAmolecule, but also to fully synthetic versions, made by various methodsof synthetic chemistry, such as nucleotide phosphoramidite chemistry, orby serial ligation of DNA oligomers, and also to forms made withchemical modifications present on the bases, sugar, or backbone, ofwhich many are known to those skilled in nucleic acid biochemistry,including methylated bases, adenylated bases, other epigeneticallymarked bases, or also including non-standard or universal bases, such asinosine or 3-nitropyrrole, or other nucleotide analogues, or ribobases,or abasic sites, or damaged sites, and also including such DNA analoguesas Peptide Nucleic Acids (PNA), Locked Nucleic Acids (LNA), Xeno NucleicAcids (XNA) (a family of sugar-modified forms of DNA, including HexitolNucleic Acid (HNA)), Glycol Nucleic Acid (GNA), etc., and also includingthe biochemically similar RNA molecule along with synthetic RNA andmodified forms of RNA. All these biochemically closely related forms areimplied by the use of the term DNA, in the context of referring to thedata storage molecule used in a DNA storage system, including a templatesingle strand, a single strand with oligomers bound thereon,double-stranded DNA, and double strands with bound groups such as groupsto modify various bases. In addition, as used herein, the term DNA mayrefer to the single-stranded forms of such molecules, as well as doublehelix or double-stranded forms, including hybrid duplex forms, includingforms that containing mismatched or non-standard base pairings, ornon-standard helical forms such as triplex forms, as well as moleculesthat are partially double-stranded, such as a single-stranded DNA boundto an oligonucleotide primer, or a molecule with a hairpin secondarystructure. In various embodiments, DNA refers to a molecule comprising asingle-stranded DNA component having bound oligonucleotide segmentsand/or perturbing groups that can act as the substrate for a probemolecule, such as a polymerase, to process, and in doing so, generatedistinguishable signals in a monitored electrical parameter of amolecular sensor.

DNA sequences as written herein, such as GATTACA, refer to DNA in the 5′to 3′ orientation, unless specified otherwise. For example, GATTACA aswritten herein represents the single-stranded DNA molecule5′-G-A-T-T-A-C-A-3′. In general, the convention used herein follows thestandard convention for written DNA sequences used in the field ofmolecular biology.

As used herein, the term “oligonucleotide” or “binding oligonucleotide”refers to a short segment of DNA, or analog forms described above,having a length in the range of 3 to 100 bases, or 5 to 40 bases, or 10to 30 bases, which can hybridize with a complementary sequence containedin a template strand. Such hybridization may be through perfectWatson-Crick base-paring matches, or may involve mismatches ornonstandard base pairings.

As used herein, the term “probe molecule” refers to a moleculeelectrically wired between two electrodes in a pair of spaced apartelectrodes in a molecular sensor, capable of interacting with moleculesin the environment around the sensor to provide perturbations in amonitored electrical parameter of the molecular sensor relating to themolecular interactions. A probe molecule herein may comprise apolymerase molecule, or any other processive enzyme such as a helicaseor exonuclease. In a molecular sensor herein used as a DNA readingdevice, a probe molecule may be conjugated to a bridge molecule that isdirectly wired across two spaced apart electrodes in a pair ofelectrodes by direct bonds between the bridge molecule and theelectrodes.

As used herein, the term “polymerase” refers to an enzyme that catalyzesthe formation of a nucleotide chain by incorporating DNA or DNAanalogues, or RNA or RNA analogues, against a template DNA or RNAstrand. The term polymerase includes, but is not limited to, wild-typeand mutant forms of DNA polymerases, such as Klenow, E. coli Pol I, Bst,Taq, Phi29, and T7, wild-type and mutant forms of RNA polymerases, suchas T7 and RNA Pol I, and wild-type and mutant reverse transcriptasesthat operate on an RNA template to produce DNA, such as AMV and MMLV. Apolymerase is a choice for a probe molecule in a molecular sensor hereinusable as a DNA reader.

As used herein, a “bridge molecule” refers to a molecule bound betweentwo spaced-apart electrodes in a pair of electrodes, to span theelectrode gap between the two and complete an electrical circuit of amolecular sensor. In various embodiments, a bridge molecule has roughlythe same length as the electrode gap, such as 1 nm to 100 nm, or in somecases, about 10 nm. Bridge molecules for use herein may comprisedouble-stranded DNA, other analog DNA duplex structures, such asDNA-RNA, DNA-PNA or DNA-LNA or DNA-XNA duplex hybrids, peptides, proteinalpha-helix structures, antibodies or antibody Fab domains, graphenenanoribbons or carbon nanotubes, silicon nanowires, or any other of awide array of molecular wires or conducting molecules known to thoseskilled in the art of molecular electronics. A bridge molecule hereinmay be described as having a “first” and “second” end, such as a base ator near the 3′ end and a base at or near the 5′ end of a DNA moleculeacting as a bridge molecule. For example, each end may be chemicallymodified such that the first end of a bridge molecule bonds to a firstelectrode and the second end of a bridge molecule bonds to a secondelectrode in a pair of spaced-apart electrodes. This nomenclature aidsin visualizing a bridge molecule spanning an electrode gap and bondingto each electrode in a pair of spaced-apart electrodes. In variousembodiments, the first and second ends of a bridge molecule may bechemically modified so as to provide for self-assembly between thebridge molecule and a probe molecule such as a polymerase, and/orbetween the bridge molecule and one or both electrodes in a pair ofelectrodes. In a non-limiting example, the ends of a bridge molecule arebonded to each of two electrodes in a pair of spaced apart electrodes bythiol (—SH)-gold bonds.

As used herein, the term “sensor molecular complex” or “sensor probecomplex” refers to the combination of a probe molecule and a bridgemolecule, with the two molecules conjugated together, and the assemblagewired into the sensor circuit, or any combination of more than twomolecules that together are wired into the sensor circuit.

As used herein, the term “dNTP” refers to both the standard, naturallyoccurring nucleoside triphosphates used in biosynthesis of DNA (i.e.,dATP, dCTP, dGTP, and dTTP), and natural or synthetic analogues ormodified forms of these, including those that carry base modifications,sugar modifications, or phosphate group modifications, such as analpha-thiol modification or gamma phosphate modifications, or thetetra-, penta-, hexa- or longer phosphate chain forms, or any of theaforementioned with additional groups conjugated to any of thephosphates, such as the beta, gamma or higher order phosphates in thechain. In general, as used herein, “dNTP” refers to any nucleosidetriphosphate analogue or modified form that can be incorporated by apolymerase enzyme as it extends a primer, or that would enter the activepocket of such an enzyme and engage transiently as a trial candidate forincorporation.

As used herein, “buffer,” “buffer solution” and “reagent solution”refers to a solution which provides an environment in which a molecularsensor can operate and produce signals from supplied DNA templates. Invarious embodiments, the solution is an aqueous solution, which maycomprise dissolved, suspended or emulsified components such as salts, pHbuffers, divalent cations, surfactants, blocking agents, solvents,template primer oligonucleotides, other proteins that complex with thepolymerase of the sensor, and also possibly including the polymerasesubstrates, i.e., dNTPs, analogues or modified forms of dNTPs, and DNAmolecule substrates or templates. In non-limiting examples, a buffer isused to hydrate and suspend DNA molecules that have been left in alyophilized state in a DNA information library, in order to provide theDNA to a DNA reader for decoding of stored information.

As used herein, “binary data” or “digital data” refers to data encodedusing the standard binary code, or a base 2 {0,1} alphabet, data encodedusing a hexadecimal base 16 alphabet, data encoded using the base 10{0-9} alphabet, data encoded using ASCII characters, or data encodedusing any other discrete alphabet of symbols or characters in a linearencoding fashion.

As used herein, “digital data encoded format” refers to a series ofbinary digits, or other symbolic digits or characters that come from theprimary translation of DNA sequence features used to encode informationin DNA, or the equivalent logical string of such classified DNAfeatures. In some embodiments, information to be archived as DNA may betranslated into binary data, or may exist initially as binary data, andthen this data may be further encoded with error correction and assemblyinformation, into the format that is directly translated into the codeprovided by the distinguishable DNA sequence features. This latterassociation is the primary encoding format of the information.Application of the assembly and error correction procedures is afurther, secondary level of decoding, back towards recovering the sourceinformation.

As used herein, “distinguishable DNA sequence features” means thosefeatures of a data-encoding DNA molecule that, when processed by amolecular sensor, such as one comprising a polymerase, produces distinctsignals corresponding to the encoded information. Such features may be,for example, different bases, different modified bases or baseanalogues, different sequences or sequence motifs, or combinations ofsuch to achieve features that produce distinguishable signals whenprocessed by a sensor polymerase.

As used herein, a “DNA sequence motif” refers to either a specificletter (base) sequence, or a pattern, representing any member of aspecific set of such letter sequences. For example, the following aresequence motifs that are specific letter sequences: GATTACA, TAC, or C.In contrast, the following are sequence motifs that are patterns:G[A/T]A is a pattern representing the explicit set of sequences {GAA,GTA}, and G[2-5] is a pattern referring to the set of sequences {GG,GGG, GGGG, GGGGG}. The explicit set of sequences is the unambiguousdescription of the motif, while pattern shorthand notations such asthese are common compact ways of describing such sets. Motif sequencessuch as these may be describing native DNA bases, or may be describingmodified bases, in various contexts. In various contexts, the motifsequences may be describing the sequence of a template DNA molecule,and/or may be describing the sequence on the molecule that complementsthe template.

As used herein, “sequence motifs with distinguishable signals,” in thecases of patterns, means that there is a first motif patternrepresenting a first set of explicit sequences, and any of saidsequences produces the first signal, and there is a second motif patternrepresenting a second set of explicit sequences, and any of saidsequences produces the second signal, and the first signal isdistinguishable from the second signal. For example, if motif G[A/T]Aand motif G[3-5] produce distinguishable signals, it means that any ofthe set {GAA, GTA} produces a first signal, and any of the set {GGG,GGGG, GGGGG} produces a second signal that is distinguishable from thefirst signal.

As used herein, “distinguishable signals” refers to one electricalsignal from a sensor being discernably different than another electricalsignal from the sensor, either quantitatively (e.g., peak amplitude,signal duration, and the like) or qualitatively (e.g., peak shape, andthe like), such that the difference can be leveraged for a particularuse. In a non-limiting example, two current peaks versus time from anoperating molecular sensor are distinguishable if there is more thanabout a 1×10⁻¹⁰ Amp difference in their amplitudes. This difference issufficient to use the two peaks as two distinct binary bit readouts,e.g., a 0 and a 1. In some instances, a first peak may have a positiveamplitude, e.g., from about 1×10⁻¹⁰ Amp to about 20×10⁻¹⁰ Amp amplitude,whereas a second peak may have a negative amplitude, e.g., from about 0Amp to about −5×10⁻¹⁰ Amp amplitude, making these peaks discernablydifferent and usable to encode different binary bits, i.e., 0 or 1.

As used herein, a “data-encoding DNA molecule,” or “DNA data encodingmolecule,” refers to a DNA molecule synthesized to encode data withinthe DNA's molecular structure, which can be retrieved at a later time.

As used herein, “reading data from DNA” refers to any method ofmeasuring distinguishable events, such as electrical signals or otherperturbations in a monitored electrical parameter of a circuit, whichcorrespond to molecular features in a synthetic DNA molecule that werebuilt into the synthetic DNA to encode information into the DNAmolecule.

As used herein, “electrodes” refer to nano-scale electrical conductors(more simply, “nano-electrodes”), disposed in pairs and spaced apart bya nanoscale-sized electrode gap between the two electrodes in any pairof electrodes. In various embodiments, the term “electrode” may refer toa source, drain or gate. A gate electrode may be capacitively coupled tothe gap region between source and drain electrodes, and comprise a“buried gate,” “back gate,” or “side gate.” The electrodes in a pair ofspaced-apart electrodes may be referred to specifically (and labeled assuch in various drawing figures) as the “source” and “drain” electrodes,“positive” and “negative” electrodes, or “first” and “second”electrodes. Whenever electrodes in any of the drawing figures herein arelabeled “positive electrode” and “negative electrode,” it should beunderstood the polarity indicated may be reversed, (i.e., the labels ofthese two elements in the drawings can be reversed), unless indicatedotherwise, (such as an embodiment where electrons may be flowing to anegative electrode). Nano-scale electrodes in a pair of electrodes arespaced apart by an electrode gap measuring about 1 nm to 100 nm, andeach electrode may have other critical dimensions, such as width,height, and length, also in this same nanoscale range. Suchnano-electrodes may be composed of a variety of materials that provideconductivity and mechanical stability. They may be comprised of metals,or semiconductors, for example, or of a combination of such materials.Metal electrodes may comprise, for example, titanium, chromium,platinum, or palladium. Pairs of spaced-apart electrodes may be disposedon a substrate by nano-scale lithographic techniques.

As used herein, the term “conjugation” refers to a chemical linkage,(i.e., bond), of any type known in the chemical arts, e.g., covalent,ionic, Van der Waals, etc. The conjugations of a probe molecule, such asa polymerase, to a bridge molecule, such as a double-stranded DNAmolecule, or conjugations between a bridge molecule to an electrode or ametal deposit on an electrode, may be accomplished by a diverse array ofconjugation methods known to those skilled in the art of conjugationchemistry, such as biotin-avidin couplings, thiol-gold couplings,cysteine-maleimide couplings, gold binding peptides or material bindingpeptides, click chemistry coupling, Spy-SpyCatcher protein interactioncoupling, or antibody-antigen binding (such as the FLAG peptidetag/anti-FLAG antibody system), and the like. Conjugation of a probemolecule to each electrode in a pair of spaced-apart electrodescomprises an “electrical connection” or the “electrical wiring” of theprobe molecule into a circuit that includes the probe molecule and thepair of electrodes. In other words, the probe molecule is conjugated toeach electrode in a pair of electrodes to provide a conductive pathwaybetween the electrodes that would be otherwise be insulated from oneanother by the electrode gap separating them. A conductive pathway isprovided by electron delocalization/movement through the chemical bondsof the probe molecule, such as through C—C bonds. Conjugation sitesengineered into a probe molecule, such as a polymerase, by recombinantmethods or methods of synthetic biology, may in various embodimentscomprise any one of a cysteine, an aldehyde tag site (e.g., the peptidemotif CxPxR), a tetracysteine motif (e.g., the peptide motif CCPGCC),and an unnatural or non-standard amino acid (NSAA) site, such as throughthe use of an expanded genetic code to introduce ap-acetylphenylalanine, or an unnatural crosslinkable amino acid, such asthrough the use of RNA- or DNA-protein cross-link using 5-bromouridine,(see Gott, J. M., et al., Biochemistry, 30 (25), 6290-6295 (1991)).

As used herein, the term “amplification” refers to molecular biologymethods that make one or more copies of a DNA molecule, and that, whenperformed on a pool of suitable DNA molecules collectively achievecopying of the pool. Such copying methods include converting a DNAmolecule to RNA, or vice-versa. Such methods include all forms ofexponential copying, such as PCR, in which the number of copies producedfrom an initial set of templates grows exponentially with cycle numberor time. This includes the many variants or extensions of PCR known tothose skilled in molecular biology. These include, for example,isothermal methods and rolling circle methods and methods that rely onnicking or recombinase to create priming sites, or amplification methodssuch as LAMP, DMA or RPA. Such methods also include linear amplificationmethods, in which the number of copies produced grows linearly withcycle number or time, such as T7 amplification or using a single primerwith thermocycling or with isothermal means of reinitiating polymeraseextension of a primer. This includes use of degenerate primers or randomprimers. Amplification as used herein also explicitly includes thespecial case of creation of the complementary strand of asingle-stranded template, when such complementary strand is also used torepresent the stored information in the processes of data storage—forexample, in the context of readers that read both strands in the processof recovering the stored information. Such a complementary strand mayremain in the double-stranded physical conformation with its complementin the storage DNA molecule, or may exist separated from itscomplementary strand in the storage system, in either case thisconstitutes amplification, i.e. copying, of the primary template, forinformation storage purposes. This is distinguished from the case wherea DNA reader creates a complementary strand in the course reading datafrom a single-stranded template, which is not amplification as the termis used herein—for in this case, such a strand is merely a byproduct ofthe reading process, and not itself used as an information encodingmolecule from which information is potentially extracted. DNA readersthat create such byproduct strands include the polymerase molecularelectronics readers described herein and illustrated in FIGS. 8-12 and24-26. As used herein, such DNA reader systems are amplification-free.

Amplification-Free DNA Digital Data Storage:

General aspects of amplification-free DNA data storage methods,apparatus and systems, in accordance with the present disclosure andusable for archiving and later accessing stored data, are disclosed inreference to the various drawing figures:

FIG. 1 illustrates an embodiment of an amplification-free DNAinformation storage system in accordance with the present disclosure. Asillustrated in FIG. 1, an amplification-free DNA storage systemcomprises an information encoder/decoder algorithm, a DNA writing device(synthesizer), a DNA reading device (sequencer), and a librarymanagement subsystem for managing physical DNA molecules in the libraryphysical storage to support archival operations. This example shows themajor elements of a DNA storage system, including the physical systemused to handle and maintain the DNA material during storage, and whichcarries out operations on the stored archive, such as copying. Anexternal computer provides a high level control of the system, supplyinginformation for storage, and receiving extracted information.Information is encoded as DNA sequences, synthesized into DNA molecules,stored, and then read, decoded and output. In addition, such a system iscapable of physical I/O of the DNA archive material samples as well.

FIG. 2 illustrates primary DNA storage system information phases andprocesses, including the major phases of information existing in theoverall system (depicted in FIG. 2 as boxes), along with the primaryoperations transitioning from one form to another (depicted in FIG. 2 byarrows). As shown in FIG. 2, the elements of writing, reading andlibrary management each comprise steps of physically processing DNAmolecules.

As indicated in FIG. 3, one aspect of the present system is that theseprocesses (of FIG. 2) are free of any amplification of the stored DNA.That is, the DNA storage system is amplification-free. As discussedherein in more detail, various methods and apparatus provide for theamplification-free elements depicted in FIG. 3. In various embodiments,the DNA storage system is entirely amplification-free. That is, all ofthe processes illustrated in FIG. 2 are amplification-free. In otherembodiments, a DNA storage system in accordance with the presentdisclosure may comprise an amplification-free element, but the systemmay not be entirely amplification-free overall. Nonetheless, there arestill separate benefits provided by the amplification-free elementswithin a system that is not entirely amplification-free. Such a systemthat is not entirely amplification-free, but that comprises at least oneamplification free process, is referred to herein as a“reduced-amplification” DNA information storage system.

Each major element of a DNA data storage system in accordance with thepresent disclosure is detailed herein below, including how each elementof the system relates to, or involves, DNA amplification, and how therelevant amplification-free elements can be configured for a DNA datastorage system.

In various aspects of the present disclosure, a DNA information storagesystem comprises: an encoder/decoder; a DNA writing device; and a DNAreading device.

Encoder/Decoder:

In various aspects, the encoder/decoder provides two functions: theencoder portion translates given digital/binary information or data intoa specific set of DNA sequence data that are inputs to the DNA writer.Second, the decoder portion translates a given set of DNA sequences ofthe type provided by the DNA reader back into digital information.

FIG. 4 illustrates several binary encoding schemes (herein referred toas “BES”) for converting binary data into DNA sequences. Other encodingschemes of use herein include those schemes capable of supporting errordetection and error correction. In these example encoding scenarios, theoriginating digital data that is to be stored as DNA will typicallyoriginate as electronic binary data. In various examples, information(e.g., language, music, etc.) can first be converted to electronicbinary data. This originating binary data will then be divided intosegments, augmented by reassembly data, and transformed by errorcorrecting encodings appropriate for DNA data storage to produce actualbinary data payload segments, (such as exemplified in FIG. 4), whichthen require translation to DNA sequences for subsequent DNA synthesisto produce the physical storage molecules. As will be discussed in moredetail below, a DNA logical structure comprises a data payload segmentwherein specific data is encoded. In various embodiments, a data payloadsegment comprises the actual primary digital data being stored alongwith metadata for the storage method, which may comprise data related toproper assembly of such information fragments into longer strings,and/or data related to error detection and correction, such as paritybits, check sums, or other such information overhead.

Primary translation from binary to DNA sequence is what, in variousembodiments, is performed by binary encoding schemes (BES), such asthose exemplified in FIG. 4. These encoding schemes provide primarytranslation from a digital data format, such as a binary data format, toa DNA molecular sequence format, via first producing a list ofdistinguishable signaling features that imply corresponding DNAsegments, which are assembled for the encoding DNA molecule. Choosingwhich BES is appropriate depends, in part, on the type ofdistinguishable signal features and their arrangements, (as discussedbelow in the context of FIG. 8 and illustrated in the inset of FIG. 8).As discussed herein, BES that comprises converting one bit of digitalbinary data to more than one DNA base are preferred, at least for thesake of reducing errors that would otherwise occur in both DNA writingand DNA reading.

FIG. 4 illustrates several such primary encodings, beginning with anexemplary binary data payload, a particular 32-bit word“00101001100111001111101000101101,” and converting the binary data toone or more distinguishable signal features for the encoded DNAmolecule. As illustrated in FIG. 4, BES1 is the encoding of 2 bits ofdata into 1 DNA letter, e.g., encoding 1 bit into 2 distinguishablesignaling features F1 and F2, for use with a DNA reading sensor that candistinguish these features; BES2 is the encoding of two binary digitsinto two bases (one DNA letter per one binary bit), e.g., encodingcombinations of two binary bits 00, 01, 10 and 11 into four features,F1, F2, F3 and F4, for use with a DNA reading sensor that distinguishesthese features; and BES3 shows an example where the length of thesequence tract is used to encode information, which is appropriate forcases where such sequence runs produce distinguishable signals from theDNA reading sensor. BES3 encodes two binary digits into two runs ofbases, AA and CCC, (one run of DNA letters per one binary bit). Forexample, BES3 encodes the binary strings 0, 1 and 00 into 3distinguishable features F1, F2 and F3. BES4 and BES5 illustrate thepossible use of additional DNA bases such as modified bases of baseanalogues, denoted here as X, Y, Z, and W. If such analogues producedistinguishable signals from the DNA reader, this can be used as in BES4to implement a binary code using two such distinguishable analogues.BES5 uses a total of 8 distinguishable DNA letters to represent 3 bitsof binary data, using DNA molecules composed of 4 native bases and 4modified bases, to encode the eight possible I/O 3-bit states, (one DNAbase or modified base per 3-bits of data). BES6 illustrates the use ofsequence motifs to encode binary information, (one DNA sequence motifper one binary bit), in such a case where such motifs producedistinguishable signal traces.

Encoding schemes for use herein must have a cognate sensor, such as apolymerase-based molecular sensor, capable of distinguishing the signalsof the encoding features, so that the choices of BES are directlyrelated to the properties of the sensor in distinguishing features.Digital data formats or alphabets other than binary, such ashexadecimal, decimal, ASCII, etc., can equally well be encoded into DNAsignaling features by similar schemes as the BES of FIG. 4. Schemes moresophisticated than those shown, in terms of optimal information density,such as Lempel-Ziv encoding, can highly efficiently convert and compressdata from one alphabet into another. In general, for converting a binaryor other digital data payload string or collection of strings into a DNAsequence string, or collection of such strings, the methods of losslessand lossy encoding or compression can be used to devise schemes for theprimary conversion from input digital data payloads to DNA datapayloads.

In an exemplary embodiment, a polymerase-based molecular electronicssensor produced distinguishable signals in a monitored electricalparameter of the sensor when the sensor encountered the distinguishablesignaling features of oligonucleotides 5′-CCCC-3′, 5′-GGGG-3′, and5′-AAAA-3′, when bound to the respective reverse complement templatesegments F1=5′-GGGG-3′, F2=5′-CCCC-3′, and F3=5′-TTTT-3′, presented in asynthetic DNA molecule provided in a suitable buffer to the sensor. Inthis embodiment, a binary encoding scheme was used wherein the bit 0 wasencoded as GGGG (i.e., F1), the bit 1 was encoded as CCCC, i.e., F2, andthe binary string 00 was encoded as TTTT i.e., F3. Note this encodingscheme included the encoding of 00 as TTTT, i.e., encoding as the datastring 00 rather than as two consecutive data bits of 0, which wouldhave encoded as GGGGGGGG. This encoding scheme was then used to encodean input binary data payload of “01001” into a nucleotide sequence forincorporation in the synthetic DNA molecule. The conversion to a featuresequence of F1-F2-F3-F1 began by dividing the input data string ofbinary data 01001 into the segments 0, 1, 00, and 1, and convertingthese data segments into a DNA data payload segment of the encoded DNAmolecule as 5′-GGGGCCCCTTTrGGGG-3′. In other embodiments, there may be“punctuation” sequence segments inserted between the distinguishablesignal features, which do not alter the distinguishable features, e.g.,bound oligonucleotides, which provide benefits such as accommodatingspecial properties or constraints of the DNA synthesis chemistry, or toprovide spacers for added time separation between signal features, orreduced steric hindrance, or to improve the structure of the DNAmolecule. For example, if A were such a punctuation sequence, the DNAencoding sequence would become 5′-AGGGGACCCCATTTTAGGGGA-3′. In general,such insertion of punctuation sequences or filler sequences may be partof the process of translating from a digital data payload to theencoding DNA sequence to be synthesized.

In various embodiments, information as binary data such as 010011100010may be encoded using three states A, B, C, wherein 0 is encoded as A, 1is encoded as B, and 00 is encoded as C whenever 00 occurs, (i.e., suchas not to encode 00 as AA). In accordance with this scheme, the binaryword 010011100010 is equivalent to the encoded form ABCBBBCABA.

In general, for converting a binary or other digital data payload stringor collection of strings into a DNA sequence string or collection ofsuch strings, many of the methods of lossless and lossy encoding orcompression, e.g., those well known in computer science, can be used todevise schemes for the primary conversion of input digital data payloadsto DNA sequence data payloads, as strings of distinguishable feature DNAsegments, generalizing the examples of FIG. 4. In this broader context,the BES schemes exemplified in FIG. 4 illustrate the type of featureelements that could become symbols of an alphabet for data encoding,such as standard bases, modified bases, or sequence motifs or runs,provided that such elements have a cognate reader sensor.

FIG. 5 illustrates the relationships between physical DNA molecules andthe digital data encoded therein. Physical DNA may exist insingle-stranded or double-stranded form, depending, for example, on thedetails of the storage system. The sequence data payload, in the logicalstructure shown, represents the binary data, including error correctionand addressing information, and it comprises only a portion of thesequence of the physical DNA molecule. There may be additional LEFT andRIGHT sequence segments relating to the physical handling of DNA, suchas binding sites for specific complementary oligonucleotides, regionsthat carry other forms of commonly used binding groups useful for DNAmanipulation, such as biotin sites, or segment that provide spatialseparation, or sequence segments used to calibrate properties of thereading or writing systems. Since the encoder/decoder process isgenerally performed on data and not on physical DNA, it typically doesnot itself require or prescribe any form of physical amplification ofDNA.

With continued reference to FIG. 5, the DNA logical structure shown isan example structure of an information-carrying DNA fragment. In thisexample, a PRIMER segment contains primer target/structure. Further, anL-BUFFER segment may contain signal calibration sequence for the DNAreader, or buffering sequence prior to the DATA PAYLOAD segment,containing information storing encoded sequence and related errorcorrection sequence such as parity bits. R-BUFFER may contain additionalcalibration sequence, as well as buffer sequence allowing for the probemolecule (e.g., a polymerase) to avoid getting too close to the end ofthe template when reading DNA. L-ADAPTER and R-ADAPTER may be sequenceelements related to the storage or manipulation of the associated DNAsegment, such as adapters for outer priming sites for PCR amplification,or hybridization based selection, or representing a surrounding carrierDNA for this insert, including insertion into a host organism genome asa carrier. The data payload portion of the logical structure in generalmay include the actual primary data being archived as well as metadatafor the storage method, such as relating to the assembly of thisinformation into larger strings, or error detection and correction.

DNA Writing Device:

In various embodiments, a DNA writing device for use herein takes agiven set of input DNA sequence data and produces the DNA moleculeshaving these sequences. For each desired sequence, multiple DNAmolecules representing that sequence are produced. The multiplicity ofmolecules produced can be in the ranges of 10's, 100's, 1000's,millions, billions or trillions of copies of DNA molecules for eachdesired sequence. All of these copies representing all the desiredsequences may be pooled into one master pool of molecules. It is typicalof such DNA writing systems that the writing is not perfect, and if Nmolecules are synthesized to represent a given input sequence, not allof these will actually realize the desired sequence. For example, theymay contain erroneous deletions, insertions, or incorrect or physicallydamaged bases. Such a system will typically rely on some primary meansof synthesizing DNA molecules, such as comprising chemical reactions anda fluidic system for executing the processes on a large scale in termsof the number of distinct sequences being synthesized, (see, forexample, Kosuri and Church, “Large Scale de novo DNA Synthesis:Technologies and Applications,” Nature Methods, 11: 499-509, 2014).Non-limiting examples of methods and devices for synthesizing DNAmolecules include commercial technology offered by Agilent Technologiesand Twist BioScience.

FIG. 6 illustrates methods of synthesizing DNA molecules. In variousembodiments, primary methods of synthesizing DNA co-synthesize manyinstances of the same DNA sequence without using any amplificationprocedures. Such methods perform a number of physically or chemicallyisolated reactions, wherein each isolated region includes many molecularstart sites for the synthesis of each target sequence. In FIG. 6, themolecular start sites are indicated as “M” and are located, for example,on a solid support surface as shown. The distinct target sequences, “N,”are produced in distinct isolated reaction regions along the surface,reacting all the start sites within the same reaction, wherein thesereactions may be applied in time serially or in parallel for thedifferent sequence targets, such as depending on the system used. Thesecoupling reactions, e.g., from A to B to C in FIG. 6, add one or more ofthe desired bases to the molecules growing at the start sites of thatregion. This approach includes methods such as classical phosphoramiditeDNA synthesis, as well as methods that may rely on enzymatic addition ofbases, such as methods utilizing a terminal transferase enzyme toachieve base addition.

In various embodiments, nucleotides can be preferentially selected forincorporation in nucleotide sequences based on their ease of synthesisin the writing process that forms molecules, reduced propensity to formsecondary structure in the synthesized molecules, and/or ease in readingduring the data decoding process. In various aspects, bad writing motifsand bad reading motifs are avoided in the selection of nucleotides forincorporation into nucleotide sequences, with a focus on incorporatingsegments in the nucleotide sequence that will produce mutuallydistinguishable signals when that nucleotide sequence is read to decodethe encoded information. For example, in reading a nucleotide sequence,A and T are mutually distinguishable, C and G are mutuallydistinguishable, A, C and G are mutually distinguishable, AAA and TT aremutually distinguishable, A, GG and ATA are mutually distinguishable,and C, G, AAA, TTTT, and GTGTG are mutually distinguishable. These andmany other sets of nucleotide and nucleotide segments provide mutuallydistinguishable signals in a reader, and thus can be considered forincorporation in a nucleotide sequence when encoding a set ofinformation into a nucleotide sequence.

Additionally, there are nucleotide segments that are difficult to write,and thus should be avoided when encoding a set of information into anucleotide sequence. In various embodiments, encoding of a set ofinformation into a nucleotide sequence comprises the use of one of theremaining distinguishable feature sets as the encoding symbols, such asmay correspond to binary 0/1, trinary 0/1/2 or quaternary 0/1/2/3 code,etc., along with an error correcting encoding to define the set ofinformation in a way that avoids the hard to read and hard to writefeatures. In this way, overall performance of an information storagesystem is improved.

DNA Reading Device:

In various aspects, the DNA reading device used herein is a device thattakes a pool of DNA molecules and produces a set of measured signals foreach of the molecules sampled or selected from this pool. Such signalsare then translated into a DNA sequence, or otherwise used tocharacterize the base patterns or motifs present in the DNA molecules.Current methods for reading data stored in DNA may rely on commercialnext-generation DNA sequencers for the primary recovery of sequencesfrom DNA samples. Such readers actually survey only a small portion ofthe DNA molecules introduced into the system, so that only a smallfraction will undergo an actual read attempt. Thus, amplification priorto DNA reading is common, given that most input DNA molecules are neveranalyzed, and are simply wasted. Furthermore, many methods of DNAsequencing have an amplification step as a fundamental part of theprocess that amplifies DNA onto a surface or a bead in preparation forsequencing, such as exemplified by the commercial Illumina HiSeq System(Illumina, Inc.), the 454 System (Roche, Inc.), or the Ion TorrentSystem (Thermo Fisher, Inc.), or as is done in classical Sangersequencing, such as using a thermocycling terminator sequencing reactionto produce sufficient input material required to meet the limits of thedetection process. Further, there may be amplification steps performedin nanopore sequencing. Other methods may use amplification to add tagsfor sequencing.

DNA sequencing methods may also separately rely on one or more rounds ofamplification procedures during the sample preparation phase. Suchmethods have been used for the addition of adapter DNA segments tosupport subsequent processing. Also, some sequencing methods at leastrequire that a single-stranded template have its complementary strandpresent for sequencing, such as the “Circular Consensus Sequencing” ofPacific BioSciences, Inc., or the “2D” hairpin sequencing of OxfordNanopore Technologic, Inc. Such a method, if presented with asingle-stranded template as input, requires a process with at least oneround of extension, a form of amplification, to create the complementarystrand before actual sequencing can begin. In addition, methods thatrequire a relatively large amount of input template for the primarysequencing process, such as nanopore sequencing with inefficient poreloading, also may require amplification of input DNA to achieve therequire input amount lower limits.

DNA Library Management:

In various embodiments, DNA library management comprises a collection ofoperational procedures and related methods and apparatus carried out onphysical DNA. Some such procedures relate to the mechanics of physicalstorage and retrieval of the DNA, such as drying down DNA from solution,re-suspending dried DNA into solution, and the transfer and storage ofthe physical quantities of DNA, such as into and out from freezers.Other procedures relate to the information storage management, such asmaking copies of data, deleting data, and selecting subsets of the data,all of which entail physical operations on the DNA material. Copying DNAor selecting DNA from a pool are generally performed using PCRamplification or linear amplification methods, thus common methods forlibrary management may rely on amplification of DNA. For example, for alibrary prepared with PCR primer sites in place, an entire archive canbe copied by taking a representative sample of the DNA and then PCRamplifying this up to the requisite amount for a copy. For furtherexample, for a library prepared with volume-specific PCR primers, avolume from the library can be selected by using PCR primers to amplifyup just the desired volume from a small DNA sample representing theentire library.

Motivations for Amplification in DNA Digital Data Storage:

DNA information storage systems envisioned from the above elements (DNAwriter, DNA reader) may not work effectively at a specific singlemolecule level, and this motivates the use of amplification. That is, ifthere were a target DNA data storage sequence, such as GATTACA, it wouldnot be feasible to make only a single physical molecule representingthat sequence, then archive and handle the single molecule, and thenread data from that single molecule. The infeasibility is due to themany sources of molecular structure error, loss of molecules, and limitsof signal detection that exist in many such component processes. Thus,it is frequently proposed that there be amplification of the singlemolecule, at various stages in the process, to provide many more trialmolecules to engage in all these processes. Thus, the goal achievedherein is to provide specific processes that individually orcollectively remove the need for amplification steps in the variousprocesses that comprise the DNA data storage system.

Benefits of Amplification-Free Methods and Apparatus in DNA DataStorage:

There are many benefits to amplification-free processes in a DNAinformation storage system. In general, amplification of DNA in thecontext of an information storage system will add cost, time, andoperational complexity to the system, directly from the demands of theprocedure. Amplification also typically amplifies some sequences morethan others, and thus it may introduce representational bias into thedata in storage system that could result in loss of information orinaccurate information, or an increase in the time or cost to recoverinformation. Amplification can also produce errors in the DNA sequences,as the enzymes involved can make errors during the copying process, orcan create chimeric molecules that contain sequence parts of differenttemplate DNAs, or partial molecules that are not complete copies. Thusamplification can produce errors in the data, or spurious “noise”molecules. DNA amplification can also lead to contamination, as thelarge quantities of DNA generated during amplification can contaminateother non-amplified samples and result in a substantial fraction of thetotal DNA content in such samples coming from the source ofcontamination. Thus, amplification could produce a “corruption” ofstored data. Amplification methods also typically require one, two, ormore flanking primer sequences at the ends of the DNA molecules tosupport the priming and enzymatic extension processes used to achieveamplification. Such primer sequences, which are typically in the rangeof 6 to 30 bases in length, must be synthesized into the DNA molecules,and thus this increases the cost, complexity and potential for errors inthe DNA writing processes.

Amplification also generally cannot reproduce DNA modifications, ofwhich there are a great many known in nucleic acid chemistry and whichare used in the methods described herein. Thus, the use of amplificationat any point in the DNA data storage system greatly limits the abilityto use this great diversity of modified DNA, which could otherwise beused to improve the performance of a DNA data storage system. It is thusa benefit of amplification-free systems that such systems enable the useof such modified forms of DNA to be used as the information storagemolecule. For example, modified DNA may comprise substituent groups onthe DNA bases that increase signal to noise in a sensor when the DNA isread by the sensor, thereby greatly improving the power of the readingsystem. DNA modifications can be used to enhance the writing process,the stability of the resulting molecules, or to enhance the ability tomanipulate and read data from the molecules. Use of modified DNA canprovide data security or encryption, by having detectable modificationsthat are known only to trusted parties, or that only special readingsystems could read. There are many types of modifications known to thoseskilled in such chemistry, which could potentially be used to enhancethe capabilities of DNA data storage system, such as modified bases,modifications to the DNA backbone, such as in Peptide Nucleic Acids(PNAs), or thiol-phosphate or iodo-phosphate modifications of thebackbone, or other DNA analogs such as Locked Nucleic Acids (LNAs), ordiverse Xeno Nucleic Acids (XNAs), or modifications to the sugar ring,or methylated bases, or labeled bases, or the addition of other chemicalgroups at various sites of the DNA molecule, such as biotin or otherconjugation or binding groups, or groups that create stronger signalsfor the reader.

A DNA digital data storage system in accordance with the presentdisclosure benefits from having lower operational costs by beingentirely amplification-free or by comprising at least oneamplification-free subsystem. It is a further benefit of the presentsystem that the time it takes to store and/or recover information may bereduced. It is a further benefit that the system may have lowercomplexity, and consequently lower total ownership costs, lower risks offailure, or greater mean time between failures. It is a further benefitthat the representation biases inherent in amplification processes areavoided, so that in the writing, extracting or reading DNA, the diversesequences involved get more equal representation in these respectiveprocesses and in the overall system for storing and retrievinginformation. It is a further benefit to avoid the forms of introducederror or data corruption that are contributed by amplification. It is afurther benefit to avoid the need to synthesize amplification primersequences into the DNA molecules. It is a further benefit that potentialcontamination of other DNA data storage samples by amplificationproducts is eliminated, thereby increasing system integrity, robustness,efficiency and security. It is a further benefit that amplification-freeDNA information systems or subsystems therein remain compatible with theuse of modified DNA, which may comprise modifications to the bases,sugars or backbone of the DNA, and which may provide for more effectivereading systems (e.g., enhanced sensor signals) or more effectivewriting systems (e.g., more efficient synthesis chemistry), or which mayprovide more options for encoding information into DNA.

Methods of Avoiding Amplification in Writing:

In various embodiments, synthetic methods for writing DNA are providedthat co-synthesize many instances of physical DNA molecules for eachdesired sequence. FIG. 6 shows a synthesis process intended to produceDNA molecules representing N target sequences from M distinct startsites for each target sequence. The process then comprises cyclicalsynthesis reactions that successively add one base or base segments inparallel to these M sites via regionally or chemically isolated andlocalized bulk reactions. Ideally, M DNA molecules are produced, allrepresenting the target DNA sequence. For such cyclical, successive baseaddition processes, co-synthesized replicates of each target DNAsequence are naturally produced, and thus they can provide for anamplification-free process for writing up to M physical instances ofeach target DNA sequence. Such methods can be part of anamplification-free DNA storage system. Examples of such methods of DNAsynthesis include ink-jet printer synthesis processes for printing ofDNA oligonucleotides, such as has been done commercially by TwistBioscience, Inc. and by Agilent, Inc., light-directed parallel synthesisof DNA oligonucleotides, such as has been done commercially byAffymetrix, Inc. and Nimblegen, Inc., as well as many approaches tooperating a large number of small-volume or micro-fluidic classicalphosphoramidite DNA synthesis reactions, such as has been done byApplied Biosystems, Inc. in their 3900 DNA Synthesizer (48 parallelsynthesis columns). Thus, by directly co-synthesizing a greater numberof molecules of each target sequence within such a synthesis process,amplification of these synthesis products post-synthesis to produce adesired number of copies is avoided.

Methods of Avoiding Amplification in Reading:

When reading DNA using present sequencing technologies, there is often arequirement to amplify the input DNA. In some methods, this occursbecause the method requires larger input amounts, and therefore requiresa grossly larger quantity of DNA than would be typically available fromvarious sample sources. In other methods, the creation of the sequencinglibrary includes amplification steps. In yet other methods, commonlyknown as clonal sequencing methods, many copies of the molecule to besequenced must be produced directly and localized on a support as anintegral part of the sequencing process, such as the DNA clusters usedin the Solexa/HiSeq instruments (Illumina, Inc.), the DNA “SNAP or ISP”beads used in the Ion Torrent instruments (Thermo Fisher, Inc.), or theDNA beads required by the ABI SOLiD instruments (Life Technologies,Inc.), or the DNA beads used in the 454 instruments (Roche, Inc.). Suchamplification requirements can be eliminated by using a suitablesingle-molecule sequencing method. In such methods, a single DNAmolecule is analyzed by a sensor to produce the fundamental sequenceread or data extraction from the molecule. Such methods in principle donot require amplifying the DNA molecules prior to analysis by thesensor. Thus, utilizing a suitable single molecule DNA reading sensorprovides the means to achieve amplification-free reading. Such singlemolecule methods of DNA sequencing are illustrated in FIGS. 24 and 25.

Also of note are methods that use a carbon nanotube having a polymeraseattached thereon to produce electrical signals, as in the sensorillustrated in FIG. 25. Note, however, that while in principle suchsingle-molecule sequencing systems could provide for amplification-freereading of DNA, as used commercially, such systems often requireamplification in their established protocols. For example, the OxfordNanopore Minion Sequencer requires microgram quantities of input DNA inthe standard protocol, which would require some form of amplification toachieve in the context of DNA data storage. Also, the preferredsequencing mode of the Oxford System “2D” sequencing requires having adouble-stranded molecule because it reads both strands to produce a moreaccurate consensus sequence after joining them with a hairpin adapter inthe sample preparation phase. This typically would require an enzymaticamplification step to synthesize the complementary DNA strand for theprimary single-stranded synthetic storage molecule. Other sequencers mayrequire double-stranded input DNA, and thus would typically require anamplification step to create the complementary strand from the primarysingle-stranded synthetic storage molecule.

An embodiment for a single molecule DNA reader not requiring anyamplification of the DNA to be read is a molecular electronics sensordeployed on a CMOS sensor pixel array chip, as illustrated in FIGS. 7-23and 25, and further described below.

Methods of Avoiding Amplification in DNA Archive Management:

The major operations a DNA data storage archive management system maycomprise are considered below.

1. DNA Storage Archive Operations

For a given archive, it may be desirable to perform the followingoperations:

-   -   Create a copy of the archive;    -   Append data to the archive;    -   Readout a targeted volume from the archive;    -   Delete a volume from the archive; and    -   Search the archive.

In various embodiments, a DNA archive in accordance with the presentdisclosure exists in its primary physical state as a pool (i.e., amixture) of DNA molecules, with each desired DNA sequence represented bya number of molecular exemplars. This pool of DNA molecules could bestored in a dry state, or in solution phase. In any case, the archivecan be temporarily brought up to working temperatures in a compatiblebuffer solution to perform these operations. These operations would beperformed efficiently by the physical storage system, which may includeautomation for handling of tubes, liquid handling, performingbiochemical reactions, and the other procedures related to maintainingand manipulating the physical archive material.

These storage-related operations can be achieved without amplification,in contrast to doing these operations as they would commonly be donewith amplification, as follows:

2. Copying

Copying an archive may be performed without any amplification by simplytaking an aliquot of the stock solution. This provides a functional copyas long as a sufficient amount is taken to support future retrieval ofinformation, and also to perhaps support limited numbers of furtherarchive operations, such as further copying. For contrast, copying wouldmore commonly be done via amplification, such as by including in theencoding DNA molecules amplification primer sites, and thus a smallsample from the stock can be taken, followed by priming and amplifying,in linear or exponential amplification reactions, to obtain asubstantial amount of material representing a functional copy of theoriginal archive. Thus, a beneficial way to avoid such amplificationprocesses for copying an archive is provided.

Copying of an archive may also be performed without amplification byusing an amplification-free DNA reading system to read all theinformation from the archive, and then using an amplification-free DNAwriting system to write all of the information into a new DNA archive,thereby achieving a DNA data copy of the original DNA data archive.

3. Appending

Appending data to the archive or merging archives can be achieved simplyby pooling in and mixing with the additional DNA or archive material.This does not require amplification.

4. Targeted Reading

Working with individual “volumes” within an archive can be performed inan amplification-free manner by encoding into the DNA moleculessequence-specific oligonucleotide binding sites, with a differentidentifier/binding sequence for each volume to be made so accessible.Then, to readout a specific volume, hybridization-based capture could beused to select out specific DNA fragments with desired bindingsequences. This process can be amplification free. Volume identifierscould also be added by synthesizing DNA with nucleotide modifications,so the relevant binding targets are not via DNA-sequence specifichybridization per se, but in other modifications on the bases used inthe synthesis. For example, use of biotinylated bases, or bases withvarious hapten modifications, PNAs, non-classical DNA bases, or segmentsthat carry epitope targets for antibody binding, or the use of PNAprimer sites for improved binding affinity, all similarly provideselective ability to bind or manipulate subsets of the DNA via thecorresponding interaction partners for these modifications intentionallyintroduced in the synthesis. These amplification-free targeting methodsare all in contrast to targeted reading that relies on PCR-likeprocesses to amplify out the target volume of interest.

Another embodiment of amplification-free targeted reading is the processwhereby the archive, or a representative sample of the archive, is firstpresented to an amplification-free reader, which obtains reads samplingfrom the information content of the entire archive. Presuming the readshave a volume identifier in them, the read data from the desired volumeare selected informatically from all such read data, thereby achievingthe targeted reading through informatics selection. Another suchembodiment relies on a reader that can in real-time read the volumeidentifier on a fragment, and either halts, or rejects and acquiresanother DNA fragment, if the identifier is not in the target volume, butotherwise completes the read if it detects the target volume identifier,achieving targeted reading. This is a dynamic informatics selection,which has the benefit of reading less unneeded information in the courseof reading the targeted volume. Readers that can provide this capabilityinclude the molecular electronics sensor described in detail below, aswell as certain embodiments of nanopore sequencing sensors.

5. Searching

Search of an archive for a literal input string can be achieved byencoding the search string or strings of interest into DNA form,synthesizing a complementary form or related primers for the desired DNAsequences and using hybridization extract from the archive of thesedesired sequence fragments. The hits can be identified by quantifyingthe amount of DNA recovered, or by using the DNA data reader to surveythe recovered material. The search could report either presence orabsence, or could recover the associated fragments containing the searchstring for complete reading. In contrast, searching methods that rely onPCR-like processes to amplify out the search target or to capture suchtargets and then amplify the results are to be avoided.

Embodiments of Amplification-Free Reading:

In various embodiments, amplification-free DNA reading herein comprisesan all-electronic measurement of a single DNA molecule as it isprocessed by a polymerase or other probe molecule integrated into anelectrical circuit that monitors an electrical parameter of the sensorcircuit, such as the current. In an embodiment for DNA data storagereading, these sensors are deployed on a CMOS sensor array chip, with alarge pixel array that provides the current measurement circuitry. Sucha sensor chip may have millions of sensors, each processing successiveDNA molecules, so that the required amount of input DNA may be as low asmillions of total molecules. This provides for highly scalable, fastreading of DNA molecules without the need to pre-amplify the DNA, oramplify a specific target DNA molecule it as part of the readingprocess.

In various embodiments of the DNA information storage system herein, theDNA reading device comprises a massively parallel DNA sequencing device,which is capable of high speed reading of bases from each specific DNAmolecule such that the overall rate of reading stored DNA informationcan be fast enough, and at high enough volume, for practical use inlarge scale archival information retrieval. The rate of reading basessets a minimum time on data retrieval, related to the length of storedDNA molecules.

FIG. 7 shows an embodiment of a molecular electronic sensing circuit fora molecular electronics sensor capable of amplification-free DNA readingin which a molecule completes an electrical circuit and an electricalcircuit parameter is measured versus time to provide a signal, whereinvariations in signal reflect interactions of the molecule with othermolecules in the environment. As illustrated in FIG. 7, a molecularelectronics sensor circuit 1 comprises a circuit in which a single probemolecule 2, (or alternatively, a sensor complex comprising two or asmall number of molecules), forms a completed electrical circuit byspanning the electrode gap 9 between a pair of spaced-apart nano-scaleelectrodes 3 and 4. Electrodes 3 and 4 may be positive and negativeelectrodes, or source and drain electrodes, and in this case aredisposed on a support layer 5. The sensor molecule may be electricallyconjugated in place to each of the electrodes by specific attachmentpoints 6 and 7. In certain aspects, an electronic parameter 100 of thecircuit is measured as the sensor molecule 2 interacts with variousinteracting molecules 8 to provide signals 101 in the measuredelectronic parameter. The measured parameter 100 may comprise current(i) passing between the electrodes and through the sensor molecule 2versus time, with the electrical signals 101 in the measured parameterindicative of molecular interactions between the interacting molecules 8and the sensor molecule 2, as illustrated by the plot of (i) versus (t)in FIG. 7.

FIG. 8 illustrates an embodiment of a polymerase-based molecularelectronics sensor for use herein as a DNA reader device. A sensor, suchas illustrated in FIG. 8, and comprising a polymerase or otherprocessive enzyme, produces distinguishable signals in the monitoredcurrent over time when the polymerase encounters distinct molecularfeatures on the DNA (abbreviated in FIG. 8 as “FEAT 1”, “FEAT 2”, and soforth). Such distinct molecular features can be used in plannedarrangements to encode information into synthetic DNA molecules, whichcan in turn be read via the sensor knowing that distinguishable signalswill be see in the monitored electrical parameter. In variousembodiments, the molecular complex of an individual sensor circuitcomprises a single polymerase enzyme molecule that engages with a targetDNA molecule to produce electrical signals as it processes the DNAtemplate. Under appropriate conditions, such a polymerase will producedistinguishable electrical signal features, corresponding to specificdistinct features of a template DNA molecule, such as illustrated inFIG. 8 by two different peak shapes/amplitudes in the signal trace. Suchdistinguishable signal features can therefore be used to encodeinformation in synthetic DNA molecules, through a great variety ofencoding schemes, such as those of FIG. 4, discussed above, andtherefore such a sensor provides the reader for encoded data.

FIG. 9 illustrates an embodiment where the polymerase molecule isconjugated to a bridge molecule conjugated at its ends to source anddrain electrodes to span the gap between the electrode pair. That is,the sensor comprises a molecular sensor complex further comprising thepolymerase and the bridge molecule. The sensor may further comprise aburied gate electrode as illustrated, which can be used to furthermodulate the sensitivity of the circuit. Current between the electrodesis the measured electrical parameter. When the polymerase engages aproper template, such as a primed, single-stranded DNA molecule, in thepresence of suitable buffer solution and dNTPs as shown, the activity ofthe polymerase in synthesizing a complementary strand causesperturbations in the measured signals related to the detailed kineticsof the enzyme activity. In this case, the plot of current through theelectrodes versus time provides a signal with distinguishable features(such as amplitude variations) corresponding to structural features ofthe DNA molecule being processed. In this embodiment, two differentsequence motifs AA and CCC are used in an encoded DNA molecule toprovide two distinguishable signals when encountered by the polymerase.In this way, the motifs AA and CCC provide a means to encode binary bits0/1 into the template DNA, such as by using AA for the binary bit 0 andusing CCC for the binary bit 1. Importantly, useful encoding and readingof information is possible even without single base resolution of DNAsequences, by instead relying on distinguishable sequence motifs, whichis a clear advantage to using a single bit/single DNA base BES whichnecessitates single base resolution in reading.

FIG. 10 shows the detailed protein anatomy and DNA engagement of one ofthe exemplary polymerase enzymes for use as the probe molecule of amolecular sensor herein, specifically the E. coli Klenow fragment. Thestructure shown is PDB ID 1KLN. The detailed structure and how itengages the template DNA inform the choice of how to best conjugate theenzyme into the circuit, so as not to interfere with its interactionwith DNA, and to position the signaling portions of the protein or DNAnear to the molecular bridge for enhanced signal generation viaproximity. The helix, sheet and loop portions of the enzyme are pointedout in this conformation wherein the DNA is engaged with the enzyme.

FIG. 11 shows embodiments of a molecular sensor 130A, usable as a DNAreading device, wherein the polymerase enzyme 134A is conjugated to amolecular bridge molecule 133A, at a conjugation point 135A representinga bond between a specific site on the enzyme 134A and a specific site onthe bridge molecule 133A. As shown, the bridge molecule 133A is bondedto each of the spaced-apart electrodes to span the electrode gap 139A.The bridge molecule 133A comprises first and second ends functionalizedto bond to each of the electrodes in the pair of electrodes atconjugation points 131A and 132A.

FIG. 12 shows the molecular structure of one specific embodiment of apolymerase-based molecular sensor 130B for reading DNA in a DNAinformation storage and retrieval system, wherein the polymerase 134B isconjugated to a bridge molecule 133B comprising a 20 nm long (=6 helicalturns) double-stranded DNA. The sensor 130B further comprises a pair ofspaced apart chromium electrodes 138B and 139B, disposed on a substratelayer, such as SiO₂, and spaced apart by about 10 nm. On each electrode138B and 139B are deposits of gold 131B that participate in the bondingof the bridge molecule to each of the electrodes. The DNA bridgemolecule 133B shown is conjugated to the gold-on-chromium electrodesthrough thiol groups on first and second ends of the DNA bridge, bindingto gold via sulfur-gold bonds 132B, and wherein the polymerase 134B isconjugated to the DNA bridge molecule 133B at a centrally locatedbiotinylated base on the DNA bridge 135B, bound to a streptavidinmolecule 136B, in turn bound to the polymerase 134B via a specificbiotinylated site 135B on the polymerase 134B. In this way, thestreptavidin 136B links the polymerase 134B to the DNA bridge molecule133B by way of two biotin-streptavidin linkages 135B. The processiveenzyme molecular sensor 130B is illustrated translocating a DNAsubstrate molecule 137B. As discussed, the DNA substrate molecule 137Bmay be encoded with information comprising arrangements of signalingfeatures such as bound DNA oligonucleotide segments or perturbinggroups.

In various embodiments of a molecular electronics sensor for use herein,the polymerase may be a native or mutant form of Klenow, Taq, Bst, Phi29or T7, or may be a reverse transcriptase. In various embodiments, themutated polymerase forms will enable site specific conjugation of thepolymerase to the bridge molecule, arm molecule or electrodes, throughintroduction of specific conjugation sites in the polymerase. Suchconjugation sites engineered into the protein by recombinant methods ormethods of synthetic biology may, in various embodiments, comprise acysteine, an aldehyde tag site (e.g., the peptide motif CxPxR), atetracysteine motif (e.g., the peptide motif CCPGCC), or an unnatural ornon-standard amino acid (NSAA) site, such as through the use of anexpanded genetic code to introduce a p-acetylphenylalanine, or anunnatural cross-linkable amino acid, such as through use of RNA- orDNA-protein cross-link using 5-bromouridine.

In various embodiments, the bridge molecule may comprise double-strandedDNA, other DNA duplex structures, such as DNA-PNA or DNA-LNA or DNA-RNAduplex hybrids, peptides, protein alpha-helix structures, antibodies orantibody Fab domains, graphene nanoribbons or carbon nanotubes, or anyother of a wide array of molecular wires or conducting molecules knownto those skilled in the art of molecular electronics. The conjugationsof polymerase to such molecules, or of such molecules to the electrodes,may be by a diverse array of conjugation methods known to those skilledin the art of conjugation chemistry, such as biotin-avidin couplings,thiol-gold couplings, cysteine-maleimide couplings, gold or materialbinding peptides, click chemistry coupling, Spy-SpyCatcher proteininteraction coupling, antibody-antigen binding (such as the FLAG peptidetag/anti-FLAG antibody system), and the like. Coupling to electrodes maybe through material binding peptides, or through the use of a SAM(Self-Assembling-Monolayer) or other surface derivatization on theelectrode surface to present suitable functional groups for conjugation,such as azide or amine groups. The electrodes comprise electricallyconducting structures, which may comprise any metal, such as gold,silver, platinum, palladium, aluminum, chromium, or titanium, layers ofsuch metals in any combination, such as gold on chromium, orsemiconductors, such as doped silicon, or in other embodiments, acontact point of a first material on a support comprising a secondmaterial, such that the contact point is a site that directs chemicalself-assembly of the molecular complex to the electrode.

In various embodiments, electrical parameters measured in a sensor, suchas the sensor illustrated in FIG. 12, can in general be any electricalproperty of the sensor circuit measurable while the sensor is active. Inone embodiment, the parameter is the current passing between theelectrodes versus time, either continuously or sampled at discretetimes, when a voltage, fixed or varying, is applied between theelectrodes. In various embodiments, a gate electrode is capacitivelycoupled to the molecular structure, such as a buried gate or back gate,which applies a gate voltage, fixed or variable, during the measurement.In various other embodiments the measured parameter may be theresistance, conductance, or impedance between the two electrodes,measured continuously versus time or sampled periodically. In variousaspects, the measured parameter comprises the voltage between theelectrodes. If there is a gate electrode, the measured parameter can bethe gate voltage.

In various embodiments, the measured parameter in a molecularelectronics sensor, such as the sensor of FIG. 12, may comprise acapacitance, or the amount of charge or voltage accumulated on acapacitor coupled to the circuit. The measurement can be a voltagespectroscopy measurement, such that the measurement process comprisingcapturing an I-V or C-V curve. The measurement can be a frequencyresponse measurement. In all such measurements, for all such measuredparameters, there are embodiments in which a gate electrode applies agate voltage, fixed or variable, near the molecular complex during themeasurement. Such a gate will typically be physically located within amicron distance, and in various embodiments, within a 200 nm distance ofthe molecular complex. For the electrical measurements, in someembodiments there will be a reference electrode present, such as aAg/AgCl reference electrode, or a platinum electrode, in the solution incontact with the sensor, and maintained at an external potential, suchas ground, to maintain the solution at a stable or observed potential,and thereby make the electrical measurements better defined orcontrolled. In addition, when making the electrical parametermeasurement, various other electrical parameters may be held fixed atprescribed values, or varied in a prescribed pattern, such as, forexample, the source-drain electrode voltage, the gate voltage if thereis a gate electrode, or the source-drain current.

The use of a sensor, such as the sensor illustrated in FIG. 12, tomeasure distinguishable features of a DNA molecule requires thepolymerase to be maintained in appropriate physical and chemicalconditions for the polymerase to be active, to process DNA templates,and to produce strong, distinguishable signals above any backgroundnoise (i.e., high signal-to-noise ratio, or “SNR”). To achieve this, thepolymerase may reside in an aqueous buffer solution. In variousembodiments, a buffer solution may comprise any combination of salts,e.g., Nalco or KCl, pH buffers, Tris-HCl, multivalent cation cofactors,Mg, Mn, Ca, Co, Zn, Ni, Fe or Cu, or other ions, surfactants, such asTween, chelating agents such as EDTA, reducing agents such as DTT orTCEP, solvents, such as betaine or DMSO, volume concentrating agents,such as PEG, and any other component typical of the buffers used forpolymerase enzymes in molecular biology applications and known to thoseskilled in the field of molecular biology. The sensor signals may alsobe enhanced by such buffers being maintained in a certain range of pH ortemperature, or at a certain ionic strength. In various embodiments, theionic strength may be selected to obtain a Debye length (electricalcharge screening distance) in the solution favorable for electricalsignal production, which may be, for example, in the range of from about0.3 nm to about 100 nm, and in certain embodiments, in the range of fromabout 1 nm to about 10 nm. Such buffers formulated to have larger Debyelengths may be more dilute or have lower ionic strength by a factor of10, 100, 1000, 100,000 or 1 million relative to the bufferconcentrations routinely used in standard molecular biology proceduressuch as PCR. Buffer compositions, concentrations and conditions (pH,temperature, or ionic strength, for example) may also be also selectedor optimized to alter the enzyme kinetics to favorably increase thesignal-to-noise ratio (SNR) of the sensor, the overall rate of signalproduction, or overall rate of information production, in the context ofreading data stored in DNA molecules. This may include slowing down orspeeding up the polymerase activity by these methods, or altering thefidelity or accuracy of the polymerase. This optimal buffer selectionprocess consists of selecting trial conditions from the matrix of allsuch parameter variations, empirically measuring a figure of merit, suchas related to the discrimination of the distinguishable features, or tothe over speed of feature discrimination when processing a template, andusing various search strategies, such as those applied in statisticalDesign Of Experiment (DOE) methods, to infer optimal parametercombinations.

The use of a sensor such as the sensor of FIG. 12 to measuredistinguishable features of a DNA molecule requires the polymerase beprovided with a supply of dNTPs so that the polymerase can actprocessively on a template single-stranded DNA molecule to synthesize acomplementary strand. The standard or native dNTPs are dATP, dCTP, dGTP,and dTTP, which provide the A, C, G, and T base monomers forpolymerization into a DNA strand, in the form required for the enzyme toact on them as substrates. Polymerase enzymes, native or mutant, mayalso accept analogues of these natural dNTPs, or modified forms, thatmay enhance or enable the generation of the distinguishable signals.

In various aspects of DNA reading herein, if a system reads a DNAmolecule at a speed of 1 base per 10 minutes, as is representative ofcurrent next generation, optical dye-labeled terminator sequencers, thenreading a 300 base DNA molecule takes at least 3,000 minutes (50 hours),aside from any time required to prepare the sample for reading. Suchrelatively slower systems therefore favor storing information in alarger number of shorter reads, such as 30 base reads that could be readin 5 hours. However, this requires a larger number of total reads, sothe system must support billions or more such reads, as it the case onsuch sequencers. The current generation of optical massively parallelsequencers, read on the order of 3 billion letters of DNA per 6-minutecycle, or roughly the equivalent of 1 billion bits per minute, or 2 MBper second, although for data stored as 100 base DNA words, this wouldalso require 600 minutes (5 hours). This can be seen to be a relativelylow rate of data reading, although within a practical realm, as atypical book may contain 1 MB of textual data. The overall rate ispractical, but the slow per base time makes this highly inefficient forreading a single book of data, and ideally matched to bulk reading of36,000 books in parallel, over 5 hours. Thus, there is also a lack ofscalability in this current capability, and also a high capital cost ofthe reading device (optical DNA sequencers cost in the $100,000 to$1,000,000 range presently). More critically, on such current systems,the cost of sequencing a human genome worth of DNA, 100 billion bases,is roughly $1,000, which means the cost of reading information is $1,000per 200 Giga-bits, or $40 per GB. This is radically higher than the costof reading information from magnetic tape storage or CDs, which is onthe order of $1 per 10,000 GB, or $0.0001 per GB, 400,000 fold lesscostly. Thus the cost of reading DNA should be reduced by several ordersof magnitude, even by 1,000,000 fold, to make this attractive for largescale, long term archival storage, not considering other advantages.Such improvements may indeed be possible, as evidenced by themillion-fold reduction in costs of sequencing that has already occurredsince the first commercial sequencers were produced.

In various embodiment, the DNA reader of the present system comprisessubstantially lower instrument capital costs, and higher per-basereading speed, and greater scalability in total number of reads per run,compared to currently available optical next generation sequencinginstruments. In various aspects, the reading device for use herein isbased on a CMOS chip sensor array device in order to increase the speedand scalability and decrease the capital costs. An embodiment of such adevice comprises a CMOS sensor array device, wherein each sensor pixelcontains a molecular electronic sensor capable of reading a singlemolecule of DNA without any molecular amplification or copying, such asPCR, required. In various embodiments, the CMOS chip comprises ascalable pixel array, with each pixel containing a molecular electronicsensor, and such a sensor comprising a bridge molecule and polymeraseenzyme, configured so as to produce sequence-related modulations of theelectrical current (or related electrical parameters such as voltage,conductance, etc.) as the enzyme processes the DNA template molecule.

An exemplary molecular sensor and chip combination usable as a DNAreader device in the present DNA data storage system is depicted inFIGS. 12, 13, and 18-23. As discussed, FIG. 12 illustrates an exemplarymolecular sensor comprising a bridge and probe molecular sensor complexfurther comprising a bridge of double-stranded DNA having about a 20 nmlength (˜60 bases), with thiol groups at both 5′ ends for coupling togold contacts on a metal electrode. The embodiment of FIG. 12 comprisesa polymerase enzyme coupled to a molecular wire comprised of DNA, whichplugs into a nano-electrode pair to form a sensor capable of producingsequence-related signals as the polymerase enzyme processes a primed DNAtemplate.

As illustrated in FIG. 13, such a nano-sensor can be placed bypost-processing onto the pixels of a CMOS sensor pixel array, whichfurther comprises all the supporting measurement, readout and controlcircuitry needed to produce these signals from a large number of sensorsoperating in parallel. FIG. 13 illustrates an embodiment of variouselectrical components and connections in molecular sensors. In the upperportion of the figure, a cross-section of an electrode-substratestructure 300 is illustrated, with attachment to an analyzer 301 forapplying voltages and measuring currents through the bridge molecule ofthe sensor. In the lower portion of the figure, a perspective view ofelectrode array 302 is illustrated, usable for bridging circuits. Eachpair of electrodes comprises a first metal (e.g., “Metal-1”), and acontact dot or island of a second metal (e.g., “Metal-2”) at eachelectrode end near the gap separating the electrodes. In variousexamples, Metal-1 and Metal-2 may comprise the same metal or differentmetals. In other aspects, the contact dots are gold (Au) islands atopmetal electrodes comprising a different metal. In various experiments,contact dots comprise gold (Au) beads or gold (Au)-coated electrode tipsthat support self-assembly of a single bridge molecule over each gapbetween electrode pairs, such as via thiol-gold binding.

FIGS. 14A-14C show drawings of electron micrograph (EM) images ofelectrodes comprising gold metal dot contacts for bridge binding in DNAsensors. In this example, electrodes are on a silicon substrate, andwere produced via e-beam lithography. FIG. 14A shows an array oftitanium electrodes with gold dot contacts. FIG. 14B shows an electrodegap of about 7 nm with gold dot contacts and with about a 15 nmgold-to-gold spacing in a closer-up EM image. In FIG. 14C, a close-up EMshows gold dots of approximately 10 nm in size positioned at the tips ofthe electrodes.

FIG. 15 sets forth current versus time plots obtained by measuring DNAincorporation signals with the sensor of FIG. 12. The plots show thecurrent signals resulting from the sensor being supplied with variousprimed, single-stranded DNA sequencing templates and dNTPs forincorporation and polymerization. In each case, the major signal spikesrepresent signals from discrete incorporation events, wherein thepolymerase enzyme adds another base to the extending strand. At theupper left of FIG. 15, the template is 20 T bases; at the upper right,the template is 20 G bases; at the lower left, the template is 20 Abases; and at the lower right, the template is 20 C bases. Theapproximate rate of incorporation observed is about 10-20 bases persecond, consistent with standard enzyme kinetics except for the lowerrate of ˜1 base per second due to rate limiting factors (e.g., lowerdNTP concentration).

FIG. 16 shows experimental data obtained from the sensor of FIG. 12 inwhich specific sequence motifs produced signals that are usable toencode 0/1 binary data. The sensor of FIG. 12 comprises the Klenowpolymerase conjugated to a DNA bridge, which produces distinguishablesignals from the encoding DNA sequence motifs 20A, 3C and 30A in theexperimental template DNA. Such signals were produced by using thesensor of FIG. 12 in conjunction with a standard 1× Klenow buffer andrelatively high concentration of dTTP, (10 μM), and 100 times lowerconcentration of the other dNTPs. The lower concentration of the otherdNTPs, notably the low dGTP concentration, facilitates thedistinguishable signal from the CCC region via the concentration-limitedrate of incorporation. The result is that the poly-A tract has a highspike signal feature, and the poly-C tract has a low trough signalfeature, which are readily distinguishable. The peaks and trough areusable to encode 0/1 binary data in the simple manner illustrated, with0 encoded by the poly-A tract and read from the high peak signals havingseveral seconds duration, and 1 encoded by the CCC tract and read fromthe low trough features having several seconds duration.

The use of the sensors of the present disclosure to measuredistinguishable features of a DNA molecule requires the polymerase beprovided with primed, single-stranded template DNA molecules as asubstrate for polymerization of a complementary strand, in the course ofgenerating the associated signals. In the context of encodinginformation in synthetic DNA molecules, these template molecules may bewholly chemically synthetic, and can therefore be provided with chemicalor structural modifications or properties beyond those of native DNA,which may be used to enable or enhance the production of distinguishablesignals for various embodiments. The polymerase, native or an engineeredmutant, can accept as a substrate a great many such modified or analogueforms of DNA, many of which are well known to those skilled in the fieldof molecular biology. The use of such modifications to the template DNAcan be used to create features with distinguishable signals.

In various embodiments, the DNA supplied to the polymerase as a templatecomprises some form of primed (double-stranded/single-strandedtransition) site to act as an initiation site for the polymerase. Forthe purpose of storing digital data in DNA, in various embodiments, thispriming will be pre-assembled into the encoding molecule, so that nofurther sample preparation is needed to prime the DNA templatemolecules.

Since the secondary structure in a DNA template can interfere with theprocessive action of a polymerase, it may be advantageous to reduce,avoid or eliminate secondary structure in the DNA data encoding templatemolecules used in DNA data reader sensors. Many methods to reducesecondary structure interference are known to those skilled in the fieldof molecular biology. Methods to reduce, avoid or eliminate secondarystructure include, but are not limited to: using polymerases thatpossess strong secondary structure displacing capabilities, such asPhi29 or Bst or T7, either native or mutant forms of these; adding tothe buffer solvents such as betaine, DMSO, ethylene glycol or1,2-propanediol; decreasing the salt concentration of the buffer;increasing the temperature of the solution; and adding single strandbinding protein or degenerate binding oligonucleotides to hybridizealong the single strand. Methods such as these can have the beneficialeffect of reducing secondary structure interference with the polymeraseprocessing the encoding DNA and producing proper signals.

Additional methods available to reduce unwanted secondary structure forDNA data reading in accordance with the present disclosure compriseadding properties to DNA molecules produced by synthetic chemistry. Forexample, in some embodiments of the present disclosure, the dataencoding the DNA molecule itself can be synthesized from base analoguesthat reduce secondary structure, such as using deaza-G(7-deaza-2′-deoxyguanosine) in place of G, which weakens G/C basepairing, or by using a locked nucleic acid (LNA) in the strand, whichstiffens the backbone to reduce secondary structure. A variety of suchanalogues with such effects are known to those skilled in the field ofnucleic acid chemistry.

Further methods are available in the present disclosure to reduceunwanted secondary structure for the DNA data reading sensor, becausethe DNA data encoding scheme determines the template sequence, and thusthere is potential to choose the encoding scheme to avoid sequencesprone to secondary structure. Such Secondary Structure Avoiding (“SSA”)encoding schemes are therefore a beneficial aspect of the presentdisclosure. In general, for encoding schemes as described herein, whichuse distinguishable signal sequence features as the encoding elements,to the extent there are options in the choice of encoding schemes, allsuch alternative schemes could be considered, and the schemes thatproduce less (or the least) secondary structure would be favored foruse. The alternative schemes are assessed relative to a specific digitaldata payload, or statistically across a representative population ofsuch data payloads to be encoded.

For example, the importance of SSA encoding is illustrated in theembodiment where the sensor provides three distinguishable signalsequence features: AAAAA, TTTTT, and CCCCC. If all three features areused in encoding in the same strand (or on other strands), there is astrong potential for the AAAAA and TTTTT encoding elements, beingcomplementary, to hybridize and lead to secondary structure, eitherwithin the strand or between DNA strands. Thus, if the data were insteadencoded entirely by the scheme where the bit 0 encodes to AAAAA and thebit 1 encodes to CCCCC, (i.e., ignoring the use of TTTTT completely),all potential secondary structure is avoided. Thus, this encoding (orthe other SSA choice, the bit 0 encoding to TTTTT and the bit 1 encodingto CCCCC) is preferred over a scheme that uses self-complementarysequences, even though information density is reduced by giving up oneof the three available encoding elements. Thus, in general, SSA codescan be used when there are encoding options and when there is apotential for DNA secondary structure to form. As shown in this example,desirable SSA codes to reduce DNA secondary structure may be lessinformation dense than what is theoretically possible for thedistinguishable signal states. However, this tradeoff can result in anet gain of information density, or related overall cost or speedimprovements, by avoiding data loss related to DNA secondary structure.

In various embodiments, methods for reducing secondary structurecomprises the use of binding oligonucleotides to protect the singlestrand, wherein the oligonucleotides are chosen with sequence orsequence composition that will preferentially bind to the encodingfeatures. Such binding oligonucleotides may more effectively protect thesingle strand and general degenerate oligonucleotides. For example, inthe case described above with three distinguishable signal sequencefeatures AAAAA, TTTTT, and CCCCC, all three could be used as encodingfeatures, and they could be protected in single-stranded form by bindingthe template to the oligonucleotides TTTTT, AAAAA, and GGGGG, or toenhanced binding analogues of these, such as RNA, LNA or PNA forms,instead of DNA. Thus, use of binding oligonucleotides thatpreferentially bind to the encoding features is another means tomitigate unwanted secondary structure effects, although such bindingoligonucleotides must be used with strand-displacing polymerases, suchas native or mutant forms of Klenow, Bst or Phi29, such that theoligonucleotides themselves do not interfere. A further method foravoiding secondary structure is to prepare the information encoding DNAin primarily double-stranded form, with a nick or gap at the primer sitefor polymerase initiation, and the rest of the molecule in duplex form(with or without a hairpin bend) so that the DNA molecule exists insolution in a substantially duplex form, free of secondary structure dueto single-strand interactions, within or between molecules.

In various embodiments, DNA molecules used to encode information forreading by the cognate molecular sensor can be prepared witharchitecture facilitating the reading process as well as the encodingand decoding processes. Various embodiments of DNA architecture areillustrated in FIG. 17. Illustrated is a representative physical form ofa primed single-stranded DNA template (at the top of the drawing), alongwith the logical forms of an information encoding molecule for use in adigital data storage system. Exemplary forms may include Left and RightAdapters (shown as “L ADAPTOR” and “R ADAPTOR”), to facilitatemanipulation of the information coding DNA molecules, a primer (e.g.,pre-primed or self-priming, shown as “PRIMER”), left and right buffersegments (shown as “L-BUFFER” and “R-BUFFER”) and a data payload segment(“DATA PAYLOAD”).

With continued reference to FIG. 17, the adapters may comprise, forexample, primers for universal amplification processes, used to copy thestored data, or may comprise hybridization capture sites or otherselective binding targets, for targeted selection of molecules from apool. In various embodiments, a primer segment contains primertarget/structure, the L-BUFFER segment may contain a signal calibrationsequence for the reader, or a buffering sequence prior to the DATAPAYLOAD segment, which contains information storing encoded sequence andrelated error correction sequence such as parity bits, as well asmetadata for the storage method, such as related to the assembly of thisinformation into larger strings. In various aspects, the R-BUFFER maycontain an additional calibration sequence, as well as a buffer sequencepreventing the polymerase enzyme getting too close to the end of thetemplate when reading data. In various embodiments, the L-ADAPTER andR-ADAPTER may be sequence elements related to the storage ormanipulation of the associated DNA segment, such as adapters for outerpriming cites for PCR amplification, or hybridization based selection,or representing a surrounding carrier DNA for this insert, includinginsertion into a host organism genome as a carrier. In variousembodiments, the adapters may comprise surrounding or carrier DNA, forexample in the case of DNA data molecules stored in live host genomes,such as in bacterial plasmids or other genome components of livingorganisms.

With further reference to FIG. 17, the L-BUFFER and R-BUFFER segmentsmay comprise DNA segments that support the polymerase binding footprint,or the segments may comprise various calibration or initiation sequencesused to help interpret the signals coming from the data payload region.These buffer segments may contain molecular barcode sequences that areused to distinguish unique molecules, or to identify replicate moleculesthat are derived from the same originating single molecule. One suchmethod of barcoding, known to those skilled in DNA oligo synthesis,comprises the addition of a short random N-mer sequence, typically 1 to20 bases long, made for example by carrying out synthesis steps withdegenerate mixtures of bases instead of specific bases.

With continued reference to FIG. 17, DNA logical structures comprise adata payload segment wherein specific data is encoded. In variousembodiments, a data payload segment comprises the actual primary digitaldata being stored along with metadata for the storage method, which maycomprise data related to proper assembly of such information fragmentsinto longer strings, and/or data related to error detection andcorrection, such as parity bits, check sums, or other such informationoverhead.

In various aspects of the present disclosure, a DNA data payload ofinterest is processed by a polymerase sensor multiple times to provide amore robust recovery of digital data from DNA storage. In other aspects,a collection of such payloads on average are processed some expectednumber of multiple times. These examples benefit from a more accurateestimation of the encoding distinguishable features by aggregating themultiple observations. Multiple processing also has the benefit ofovercoming fundamental Poisson sampling statistical variability toensure that, with high confidence, a data payload of interest is sampledand observed at least once, or at least some desirable minimal number oftimes.

In various embodiments, the number of such repeat interrogations is inthe range of 1 to about 1000 times, or in the range of about 10 to 100times. Such multiple observations may comprise: (i) observations of thesame physical DNA molecule by the polymerase sensor, and/or (ii) one ormore polymerase sensors processing multiple, physically distinct DNAmolecules that carry the same data payload. In the latter case, suchmultiple, physically distinct DNA molecules with the same data payloadmay be the DNA molecules produced by the same bulk synthesis reaction,the molecules obtained from distinct synthesis reactions targeting thesame data payload, or replicate molecules produced by applyingamplification or replication methods such as PCR, T7 amplification,rolling circle amplification, or other forms of replication known tothose skilled in molecular biology. The aggregation of such multipleobservations may be done through many methods, such as averaging orvoting, maximum likelihood estimation, Bayesian estimation, hiddenMarkov methods, graph theoretic or optimization methods, or deeplearning neural network methods.

In various embodiments of the present disclosure, digital data stored inDNA is read at a high rate, such as approaching 1 Gigabyte per secondfor the recovery of digital data, as is possible with large scalemagnetic tape storage systems. Because the maximum processing speed of apolymerase enzyme is in the range of 100-1000 bases per second,depending on the type, the bit recovery rate of a polymerase-basedsensor is limited to a comparable speed. Thus, in various embodimentsmillions of sensors are deployed in a cost effective format to achievethe desired data reading capacity.

In various embodiments, many individual molecular sensors are deployedin a large scale sensor array on a CMOS sensor pixel array chip, whichis the most cost-effective, semiconductor chip manufacturing process.FIG. 18 illustrates an embodiment of a fabrication stack usable tocreate a massively parallel array of molecular sensors on a chip. Inthis example, the sensor measurement circuitry is deployed as a scalablepixel array as a CMOS chip, a nano-scale lithography process is used tofabricate the nano-electrodes, and molecular self-assembly chemicalreactions, in solution, are used to establish the molecular complex oneach nano-electrode in the sensor array. The result of this fabricationstack is the finished DNA reader sensor array chip indicated at thebottom of FIG. 18. In various embodiments, the nanoscale lithography isdone using a high resolution CMOS node, such as a 28 nm, 22 nm, 20 nm,16 nm, 14 nm, 10 nm, 7 nm or 5 nm nodes, to leverage the economics ofCMOS chip manufacturing. In contrast, the pixel electronics may be doneat a coarser node better suited to mixed signal devices, such as 180 nm,130 nm, 90 nm, 65 nm, 40 nm, 32 nm or 28 nm. Alternatively, thenano-electrodes may be fabricated by any one of a variety of otherfabrication methods known to those skilled in the art ofnanofabrication, such as e-beam lithography, nano-imprint lithography,ion beam lithography, or advanced methods of photolithography, such asany combinations of Extreme UV or Deep UV lithography, multiplepatterning, or phase shifting masks.

FIG. 19 illustrates an embodiment of a high-level CMOS chip pixel arrayarchitecture for a DNA reader in more detail at the left of the drawingfigure. The CMOS chip pixel array architecture comprises a scalablearray of sensor pixels, with associated power and control circuitry andmajor blocks such as Bias, Analog-to-Digital convertors, and timing. Theinset in the figure shows an individual sensor pixel as a small bridgedstructure representing a single polymerase molecular sensor, and wherethis individual electronic sensor is located in the pixel array. FIG. 19also illustrates (at the right side of the figure) the details of anembodiment of a polymerase molecular electronics sensor circuit pixel inthe array. As illustrated, a complete sensor circuit comprises atrans-impedance amplifier, voltage-biasable source, drain, and(optionally) gate electrodes, and a reset switch, along with apolymerase enzyme electrically connected between the source and drainelectrodes (with or without bridge and/or arm molecules). The feedbackcapacitor illustrated is optional to improve stability of the amplifier.The output of the pixel circuit (the measurable electronic parameter) inthis embodiment is current, which is monitored for perturbationsrelating to the activity of the polymerase. That is, the current outputfrom the trans-impedance amplifier is the measurable electricalparameter for this sensor pixel that is monitored for perturbations. Itshould be noted that one of the two electrodes can be grounded, in whichcase a biasable voltage is supplied across the electrodes.

FIG. 20 illustrates an embodiment of an annotated chip design layoutfile and the corresponding finished chip for comparison. In FIG. 20,(A), at left, is the finished design of an embodiment of the CMOS pixelarray of FIG. 19 with 256 pixels, annotated to show the location of theBias 190, Array 191 and Decoder 192 regions of the chip. The designlayout also comprises a test structures 193 region. In FIG. 20, (B), atright, is a drawing of an optical microscope image of the correspondingfinished chip based on the final design, produced at TSMC, Inc.semiconductor foundry (San Jose, Calif.) with the TSMC 180 nm CMOSprocess, with no passivation layer.

FIG. 21 shows illustrations of scanning electron microscope (SEM) imagesof the finished CMOS chip 200 of FIG. 20 (256 pixel array, 2 mm×2 mm),which clearly shows the sub-optical surface features of the 80 μm pixel201, and notably the exposed vias (the source, gate, and drain) wherethe nano-electrodes can be deposited by post-processing and electricallyconnected into the amplifier circuit as shown in FIG. 19, at right. Thefurthest right drawing of a 100 nm SEM image 202 in FIG. 21 shows ane-beam lithography fabricated pair of spaced apart nanoelectrodes with amolecular complex in place. The sketch 203 at the bottom right of FIG.21 is an illustration of the processive enzyme molecular electronicssensor comprising a polymerase molecular complex 207, spaced apartelectrodes 204 and 205, each labeled by a gold dot contact, wherein theelectrode gap 206 is about 10 nm.

In various embodiments of a DNA reader device, use of a CMOS chip devicein conjunction with nano-scale manufacturing technologies, ultimatelyyield a much low cost, high throughput, fast, and scalable system. Forexample, sensors such as this can process DNA templates at the rate of10 or more bases per second, 100 or more times faster than currentoptical sequencers. The use of CMOS chip technology ensures scalabilityand low system cost in a mass-producible format that leverages theenormous infrastructure of the semiconductor industry. As noted,whatever error modes or accuracy limitations may exist in a DNA sensor,or that may arise at faster reading speed (e.g., by modifying the enzymeor buffer or temperature or environmental factors, or sample data atlower time resolution), can be compensated for in the overallencoder/decoder-reader-writer framework described.

In various embodiments of the present disclosure, a DNA reader chip foruse herein comprises at least 1 million sensors, at least 10 millionsensors, at least 100 million sensors, or at least 1 billion sensors.Recognizing that at a typical sensor data sampling rate of 10 kHz, andrecording 1 byte per measurement, a 100 million sensor chip produces rawsignal data at a rate of 1 Terabyte (TB) per second. In considering howmany sensors are desirable on a single chip, one critical considerationis the rate at which such a chip can decode digital data stored in DNAcompared to the desirable digital data reading rates. It is, forexample, desirable to have digital data read out at a rate of up toabout 1 Gigabyte per second. Note that each bit of digital data encodedas DNA will require multiple signal measurements to recover, given thata feature of the signal use used to store this information, so this rawsignal data production rate for the measured signal will be much higherthat the recovery rate of encoded digital data. For example, if 10signal measurements are required to recover 1 bit of stored digitaldata, and each measurement is an 8-bit byte, that is a factor of 80 bitsof signal data to recover 1 bit of stored digital data. Thus, digitaldata reading rates are anticipated to be on the order of 100 timesslower than the sensor raw signal data acquisition rate. For thisreason, achieving desirable digital data reading rate of 1 Gigabyte/secwould require nearly 0.1 TB/sec of usable raw signal data. Further,given that not all the sensors in a single chip may be producing usabledata, the need for chips that produce up to 1 TB/sec of raw data isdesirable, based on the desired ultimate digital data recover rates fromdata stored as DNA. In various embodiments, such recovery ratescorrespond to a 100 million sensor pixel chip.

In various embodiments of the present disclosure, multiple chips aredeployed within a reader system to achieve desired system-level digitaldata reading rates. The DNA data reader chip of FIG. 18 is, in variousembodiments, deployed as part of a complete system for reading digitaldata stored in DNA.

The features of an embodiment of a complete system are illustrated inFIG. 22. In various aspects, and with reference to FIG. 22, a completedigital data reading system comprises a motherboard with a staging areafor an array of multiple chips, in order to provide data readingthroughput beyond that of the limitations of a single chip. Such chipsare individually housed in flow cells, with a fluidics liquid handlingsystem that controls the additional and removal of the sensor systemliquid reagents. In addition, the fluidic system receives DNA encodingdata in solution form, originating from a data repository source. Themotherboard would also comprise a suitable first stage data processingunit, capable of receiving and reducing raw signal data at very highrates, such as exceeding 1 TB per second, exceeding 10 TB per second, orexceeding 100 TB per second, indicated as a primary signal processor.This primary processor may comprise one, multiple, or combinations of aFPGA, GPU, or DSP device, or a custom signal processing chip, and thismay optionally be followed by stages of similar such signal processors,for a processing pipeline. Data output of this primary pipeline istypically transferred to a fast data storage buffer, such as a solidstate drive, with data from here undergoing further processing ordecoding in a CPU-based sub-system, from which data is buffered into alower speed mass storage buffer, such as a hard drive or solid statedrive or array of such drives. From there it is transferred to anauxiliary data transfer computer sub-system that handles the subsequenttransfer of decoded data to a destination. All these system operationsare under the high-level control of an auxiliary control computer thatmonitors, coordinates and controls the interplay of these functionalunits and processes.

In some embodiments, chips within the reader system may be disposable,and replaced after a certain duty cycle, such as 24 hours to 48 hours.In other embodiments, the chips may be reconditioned in place after sucha usage period, whereby the molecular complex, and possibly conjugatinggroups, are removed, and then replaced with new such components througha serious of chemical solution exposures. The removal process maycomprise using voltages applied to the electrodes to drive removal, suchas an elevated violated applied to the electrodes, or an alternatingvoltage applied to the electrodes, or a voltage sweep. The process mayalso comprise the use chemicals that denature, dissolve or dissociate orotherwise eliminate such groups, such as high molarity urea, orguanidine or other chaotropic salts, proteases such as Proteinase K,acids such as HCl, bases such as KOH or NaOH, or other agents well knownin molecular biology and biochemistry for such purposes. This processmay also include the use of applied temperature or light to drive theremoval, such as elevated temperature or light in conjunction withphoto-cleavable groups in the molecular complex or conjugation groups.

FIG. 23 illustrates n embodiment of a cloud based DNA data archivalstorage system, in which the complete reader system such as outlined inFIG. 22 is deployed in aggregated format to provide the cloud DNA readerserver of the overall archival storage and retrieval system. FIG. 23shows a cloud computing system, with a standard storage format (upperleft). Such a standard cloud computing system is provided with DNAarchival data storage capability as indicated. Some cloud-based DNAsynthesis system can accept binary data from the cloud computer, andproduce the physical data encoding DNA molecules. This server stored theoutput molecules in a DNA data storage archive, lower right, wheretypically the physical DNA molecules that encode data could be stored indried or lyophilized format, or in solution, at ambient temperature orcooled or frozen. From this archive, when data is to be retrieved, a DNAsample from the archive is provided to the DNA data reader server, whichoutputs decoded binary data back to the primary cloud computer system.This DNA data reader server may be powered by a multiplicity of DNAreader chip-based systems of the kind indicated in FIG. 22, incombination with additional computers that perform the final decoding ofthe DNA derived data back to the original data format of the primarycloud storage system.

FIGS. 24, 25 and 26 illustrate the related use of other single-moleculeDNA readers that may support amplification-free reading, and thatcomprise a polymerase (or other processive enzyme) in the sensor. FIG.26 illustrates a zero mode wave guide polymerase optical sensor, incross section.

In various embodiments, a molecular electronics sensor comprises theconfiguration illustrated in FIG. 24. In this case, fundamentalelectronic measurements are made by a nanopore ionic current sensor thatconsists of electrodes on either side of a membrane, a pore localized inthe membrane, and an aqueous solution phase residing on both sides ofthe pore. In this embodiment, the pore regulates the passage of ioniccurrent (indicated by the dashed arrow and the “i”). The pore maycomprise a biological protein nanopore, native or mutated, and themembrane may comprise a lipid membrane, or synthetic analogue thereof.The pore may also comprise a solid state pore, with the membranecomprising a thinned membrane composed of a solid material such as SiNor Teflon. The pore may have electrodes of the same polarity, or, asillustrated, opposite polarity. As shown in FIG. 24, the polymerasemolecule is further complexed with the pore, as part of a molecularcomplex involving a small number of molecules embedded through themembrane as part of the pore and to provide a conjugation to thepolymerase. As the polymerase processes a DNA template, the ioniccurrent through the pore is modulated by this activity, producingdistinguishable signal features that correspond to distinct sequencefeatures. Aside from a different geometry of the nano-electricalmeasurement, the considerations are otherwise identical to those alreadyreviewed herein. That is, nano-pore current sensor versions of thepolymerase-based DNS digital data reader are of similar use herein. Invarious embodiments, the polymerase is directly and specificallyconjugated to the pore, and wherein modified dNTPs are used to producedistinguishable signals from DNA sequence features. For producingsignals in a nanopore sensor, such dNTP modifications may comprisegroups on the 7-phosphate of the dNTP, which can occlude the pore whilethe dNTP is undergoing incorporation by the polymerase, therebyresulting in current suppression features. In various embodiments, suchmodifications comprise extending the tri-phosphate chain to 4, 5, 6 orup to 12 phosphates, and adding terminal phosphate groups, or groups toany of phosphates at position 2 or more, which are removed by polymeraseincorporation, such groups including polymers that may occlude the poreby entering pore, such as comprising PEG polymers or DNA polymers. Thepolymerase conjugation to the pore may comprise any one of possibleconjugation chemistries, such as a molecular tether, or Spy-SpyCatcherprotein-based conjugation system, or the like.

In various embodiments, a molecular electronic sensor for reading DNAcomprises a carbon nanotube. As illustrated in FIG. 25, the bridgemolecule comprises a carbon nanotube (represented by the bold horizontalbar in FIG. 25 bridging the gap between positive and negativeelectrodes). In various aspects, the carbon nanotube bridge comprises asingle or multi-walled carbon nanotube, and is conjugated to thepolymerase molecule at a specific site using any of many possibleconjugation chemistries. Such a conjugation may, for example, comprise apyrene linker to attach to the nanotube via π-stacking of the pyrene onthe nanotube, or may comprise attachment to a defect site residing inthe carbon nanotube. In this case, the current passing through a carbonnanotube molecular wire is known to be highly sensitive to othermolecules in the surrounding environment. It is further known thatcurrent passing through a carbon nanotube is sensitive to the activityof an enzyme molecule properly conjugated to that nanotube, includingpolymerase enzymes. For this particular embodiment, all the aspects ofthe present disclosure put forth above apply in this instance, toprovide a carbon nanotube based sensor for reading digital data storedin DNA molecules, including the related beneficial aspects, encodingschemes, chip formats, systems and cloud based DNA digital data storagesystems.

An alternative sensor that produces optical signals is a Zero ModeWaveguide sensor, such as the sensor illustrated in FIG. 26. Such asensor may comprise a single polymerase as shown, conjugated to thebottom of the metallic well, in the evanescent zone of the excitationfield applied to the thin substrate, in a Total Internal Reflectionmode. The polymerase is provided with primed template and dNTPs with dyelabels on the cleavable phosphate group. When such a dNTP isincorporated, the dye label is held in the evanescent field, and isstimulated to emit photons of the corresponding dye energy spectrum orcolor. The result is that, under appropriate conditions, such a sensormay produce distinguishable optical signals as indicated, which can beused to encode digital information into DNA molecules. Thedistinguishable signals here may be photon emissions of a differentenergy distribution, or color, or emissions with differentdistinguishable spectra, or different duration or intensity or shape ofthe spectra versus time, or any combination of such elements that resultin distinguishable features. For this Zero Mode Waveguide sensorembodiment indicated in FIG. 26, all the aspects of the disclosure putforth above in the various embodiments also apply in this instance, toprovide a Zero Mode Waveguide-based sensor for reading digital datastored in DNA molecules, and the related beneficial aspects, encodingschemes, chip formats (in this case, optical sensor chips, such as imagesensor chips), systems and cloud based DNA digital data storage systemsmay apply to such a sensor.

Further Embodiments

In various embodiments, an amplification-free DNA archival storagesystem comprises: (i) an amplification-free subsystem for writing DNAdata molecules; (ii) an amplification-free subsystem for managing theDNA data molecules; and (iii) an amplification-free subsystem forreading DNA data molecules.

In various embodiments, a reduced-amplification DNA archival storagesystem comprises an amplification-free subsystem for writing DNA datamolecules.

In various embodiments, a reduced-amplification DNA archival storagesystem comprises an amplification-free subsystem for managing the DNAdata molecules.

In various embodiments, a reduced-amplification DNA archival storagesystem comprises an amplification-free subsystem for reading the DNAdata molecules.

In various embodiments, the amplification-free subsystem for writing DNAdata molecules comprises the use of isolated, localized, phosphoramiditesynthesis reactions.

In various embodiments, the amplification-free subsystem for managingDNA data molecules comprises the taking of aliquots for copying withoutamplification.

In various embodiments, the amplification-free subsystem for managingDNA data molecules comprises the use of hybridization for selectingvolumes or searching for data, without amplification.

In various embodiments, the amplification-free subsystem for reading DNAdata molecules comprises the use of a single molecule DNA sequencingsystem.

In various embodiments, the amplification-free subsystem for reading DNAdata molecules comprises the use of a molecular electronic sensor thatperforms single molecule analysis of DNA.

In various embodiments, the amplification-free subsystem for reading DNAdata molecules comprises the use of a molecular electronic sensor thatcomprises a polymerase, and performs single molecule analysis of DNA.

In various embodiments, the amplification-free subsystem for reading DNAdata molecules comprises the use of a plurality of molecular electronicsensors deployed as a sensor array on a CMOS sensor pixel chip.

In various embodiments, a cloud based DNA data storage informationsystem comprises any of the above amplification-free or reducedamplification subsystems.

In various embodiments, a method for retrieving data in anamplification-free DNA data storage and retrieval system comprises: a.obtaining a sample from the DNA molecular storage archive; and, b.reading the DNA data from the sample with an amplification free reader.

In various embodiments, a method of amplification-free DNA data storagecomprises: a. writing DNA data with an amplification-free method; b.manipulating the archive with amplification free methods; and c. readingthe DNA data with an amplification-free DNA reader.

In various embodiments, these methods above are performed usingcloud-based systems.

In various embodiments, an apparatus for retrieving data in anamplification-free DNA data storage system comprises anamplification-free DNA reader device for reading the data encoded in aDNA molecule.

In various embodiments, an apparatus for amplification-free DNA datastorage comprises: a. apparatus for writing DNA data with anamplification-free method; b. apparatus for manipulating the archivewith amplification free methods; and c. apparatus for reading the DNAdata with an amplification free reader.

Amplification-free DNA information storage methods, apparatus andsystems are provided. References to “various embodiments”, “oneembodiment”, “an embodiment”, “an example embodiment”, etc., indicatethat the embodiment described may include a particular feature,structure, or characteristic, but every embodiment may not necessarilyinclude the particular feature, structure, or characteristic. Moreover,such phrases are not necessarily referring to the same embodiment.Further, when a particular feature, structure, or characteristic isdescribed in connection with an embodiment, it is submitted that it iswithin the knowledge of one skilled in the art to affect such feature,structure, or characteristic in connection with other embodimentswhether or not explicitly described. After reading the description, itwill be apparent to one skilled in the relevant art(s) how to implementthe disclosure in alternative embodiments.

Benefits, other advantages, and solutions to problems have beendescribed with regard to specific embodiments. However, the benefits,advantages, solutions to problems, and any elements that may cause anybenefit, advantage, or solution to occur or become more pronounced arenot to be construed as critical, required, or essential features orelements of the disclosure. The scope of the disclosure is accordinglyto be limited by nothing other than the appended claims, in whichreference to an element in the singular is not intended to mean “one andonly one” unless explicitly so stated, but rather “one or more.”Moreover, where a phrase similar to ‘at least one of A, B, and C’ or ‘atleast one of A, B, or C’ is used in the claims or specification, it isintended that the phrase be interpreted to mean that A alone may bepresent in an embodiment, B alone may be present in an embodiment, Calone may be present in an embodiment, or that any combination of theelements A, B and C may be present in a single embodiment; for example,A and B, A and C, B and C, or A and B and C.

All structural, chemical, and functional equivalents to the elements ofthe above-described various embodiments that are known to those ofordinary skill in the art are expressly incorporated herein by referenceand are intended to be encompassed by the present claims. Moreover, itis not necessary for a device or method to address each and everyproblem sought to be solved by the present disclosure, for it to beencompassed by the present claims. Furthermore, no element, component,or method step in the present disclosure is intended to be dedicated tothe public regardless of whether the element, component, or method stepis explicitly recited in the claims. No claim element is intended toinvoke 35 U.S.C. 112(f) unless the element is expressly recited usingthe phrase “means for.” As used herein, the terms “comprises”,“comprising”, or any other variation thereof, are intended to cover anon-exclusive inclusion, such that a molecule, composition, process,method, or device that comprises a list of elements does not includeonly those elements but may include other elements not expressly listedor inherent to such molecules, compositions, processes, methods, ordevices.

We claim:
 1. A method of archiving information, the method comprising:converting the information into one or more nucleotides using anencoding scheme, the nucleotides predetermined to generatedistinguishable signals relating to the information in a measurableelectrical parameter of a molecular electronics sensor; assembling theone or more nucleotides into a nucleotide sequence; and synthesizing apool of replicate DNA molecules without amplification of the DNAmolecules, wherein each replicate DNA molecule incorporates thenucleotide sequence.
 2. The method of claim 1, wherein the informationcomprises a string of binary data.
 3. The method of claim 2, wherein theencoding scheme converts one or more 0/1 bits of binary data within thestring of binary data into a sequence motif comprising more than onenucleotide.
 4. The method of claim 3, wherein the step of converting theinformation comprises dividing the string of binary data into segments,wherein each segment encodes one sequence motif.
 5. The method of claim4, wherein the binary data bit 0 encodes a homopolymer of A, and thebinary data bit 1 encodes a homopolymer of C.
 6. The method of claim 1,wherein at least one of the one or more nucleotides comprises a modifiednucleotide.
 7. The method of claim 1, wherein the one or morenucleotides comprise nucleotides that are resistant to secondarystructure formation in the replicate DNA molecules compared to a variantof the same nucleotides.
 8. The method of claim 1, wherein the encodingscheme comprises any one or combination of BES1, BES2, BES3, BES4, BES5and BES6.
 9. The method of claim 1, further comprising: exposing atleast one of the replicate DNA molecules to the molecular electronicssensor without prior amplification of the DNA molecules; generating thedistinguishable signals; and converting the distinguishable signals intothe information, wherein the molecular electronics sensor comprises apair of spaced-apart electrodes and a molecular sensor complex attachedto each electrode to form a sensor circuit, wherein the molecular sensorcomplex comprises a bridge molecule electrically wired to each electrodein the pair of spaced-apart electrodes and a probe molecule conjugatedto the bridge molecule.
 10. The method of claim 9, wherein the step ofexposing at least one of the replicate DNA molecules to the molecularelectronics sensor comprises suspending the pool of DNA molecules in abuffer, taking an aliquot of the buffer, and providing the aliquot tothe sensor.
 11. The method of claim 10, wherein the buffer solutioncomprises modified dNTPs.
 12. The method of claim 9, wherein themeasurable electrical parameter of the sensor comprises a source-draincurrent between the spaced-apart electrodes and through the molecularsensor complex.
 13. The method of claim 9, wherein the probe moleculecomprises a polymerase and wherein the measurable electrical parameterof the sensor is modulated by enzymatic activity of the polymerase whileprocessing any one of the replicate DNA molecules.
 14. The method ofclaim 13, wherein the polymerase comprises the Klenow Fragment of E.coli Polymerase I, and wherein the bridge molecule comprises adouble-stranded DNA molecule.
 15. A method of archiving and retrieving astring of binary data in an amplification-free DNA information storageand retrieval system, the method comprising: dividing the string ofbinary data into segments of at least one binary bit; assigning eachsegment to a sequence motif, each sequence motif comprising at least twonucleotides, the sequence motifs predetermined to generatedistinguishable signals in a measurable electrical parameter of amolecular electronics sensor; assembling the sequence motifs into anucleotide sequence; synthesizing a pool of replicate DNA moleculesusing an amplification-free DNA writing method on a solid support, eachreplicate DNA molecule incorporating the nucleotide sequence; suspendingthe pool of DNA molecules in a buffer; taking an aliquot of the buffer;providing the aliquot to the sensor without prior amplification of theDNA molecules; generating the distinguishable signals; and convertingthe distinguishable signals into the string of binary data, wherein thesensor comprises a pair of spaced apart electrodes and a molecularsensor complex attached to each electrode to form a molecularelectronics circuit, wherein the molecular sensor complex comprises abridge molecule electrically wired to each electrode in the pair ofspaced-apart electrodes and a probe molecule conjugated to the bridgemolecule.